-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: stop preserving \r in multiline quoted strings #22746
Comments
This is the most stupid thing I've heard in my entire life and it is nothing more than stubborn refusal to accept you made a mistake. This behavior makes the Go CSV parser MODIFY data that are in the middle of a string! The standard probably does not say anything about preservibg Swedish character Ö either so why not shorten it to O following this ridiculous nonsense. For me however I'm now using libcsv which is about 7x more performant and does not corrupt data so I don't care personally anymore. Golang is dead for me. |
It appears the text and the BNF differ. The text says:
This example seems to imply that a line break is CRLF. The BNF states:
Expanded, TEXTDATA includes CR, LF, and all printable characters (0x20-0x73) excluding " unless it is doubled. TEXTDATA may contain a lone CR or a lone LF. The RFC does not cover two types of data: binary and UNICODE. A strict reading of the RFC limits a string to all printable ASCII characters plus a space, linefeed, and carriage return. The csv package is a superset of the RFC (and I think it needs to be, e.g., it should allow UNICODE). I do not believe either interpretation is wrong and it ends up being a choice of convention. I agree with Russ that man programs/operating systems feel quite free to convert CRLF to LF and a lone LF to CRLF. If an operating system or program does this then it is quite possible the text seen by the CSV parser is not exactly the original text. The main problem as I see it is aside from RFC 4180 there is no real CSV standard. Many common use cases do not follow RFC 4180. The csv package tries hard to both honor 4180 as well as accept many common use cases (is the space after a comma part of the field or part of the separator, can a quote exist in an unquoted field, must the number of fields be the same for every record). I seem to recall that Excel's CSV format was one of the variants beyond RFC 4180. The existing package does documentation does describe the default behavior. It is unfortunate there is more than one "standard" end of line sequence for text files (LF, CRLF, and I believe I have seen CR as well). Some programs/operating systems distinguish between text and binary files (FTP does, for example). In my previous comment I was erring on the side of the BNF. Russ errs on the side of principle of least surprise to most people. I believe both are reasonable positions to take, but perhaps the principle of least surprise in this case, due to the ambiguity of end of line sequences, is a better choice than a pedantic reading of the BNF. |
We handle customer data that may come in CSV format and to corrupt customer data like this should be a straight execution penalty! |
Alex, can you point to the CSV format description that your files are adhering to? I am honestly interested since back when I wrote the package I looked for every definition I could find and tried to address as many of them as possible (hence the options). Some text editors on Windows/MS-DOS will "corrupt customer data" simply by reading in and writing out a text file (they change \n to \r\n). Can you provide more detail on what type of data this is (is it even ASCII/Unicode?) and why equating \r\n with \n corrupts the meaning of the data? If there is a different package that satisfies your requirements then that is the right one to use. I do not doubt it is more performant. The standard csv package is quite old and predates Go 1.0. I would likely write it differently today. |
I don't care what Microsoft do or don't, they are incompetent people. I would never intentionally corrupt customer data like this and blame some paragraph jerkfest as the reason. This kind of reasonging is the reasoning of someone with pants up their armpits and star wars socks. A besserwisser. You can either do the obvious or you can argue current behavior is better, for me I don't care anymore I will never use such broken low quality garbage ever again. Corrupt customer data! Pfft!! |
@alexhultman. I understand if you disagree. This issue would benefit more from a civil discussion on what the right behavior is. Please see https://golang.org/conduct The idea to strip "\r" is not unprecedented. For example, in Javascript template literals, "\r\n" is always converted to "\n". As for the behavior of |
For what it's worth, as a user this seems like a strange change. CSV is usually used as a format for passing around data and it seems weird to start messing around with the data itself. I don't think it matters that "Everything else in Go uses \n as the line separator." This is just data passing through Go programs, not some aspect of Go itself. |
@bourbonspecial, suppose we read this file with two records:
Everyone seems to agree with (at least no one is making an argument against) reporting the third field in both records as "c" not "c\r". Why then should we report the middle field of the second record as "b1\r\nb2" instead of "b1\nb2"? The quotes are there to allow the field to span multiple lines, not to say "all of a sudden s are important right here but nowhere else in the file". I am genuinely interested to hear of a setting where stripping the on lines 1 and 3 of that file is correct and appropriate and desirable but stripping the on line 2 is incorrect, inappropriate, and undesirable. As I said above, I can't think of any. There is also much weight to be given to preserving existing behavior. Inevitably every behavioral change, even when everyone agrees the old behavior was a mistake, breaks some users. If we can help them see that the old behavior was wrong, that goes a long way to making clear that we're not just breaking things without considering the consequences. Here I think many users will be broken who actually do not want this new behavior and will not agree that the new behavior is desirable. Hence my interest in hearing about concrete instances when a user would want the new behavior. But I think I'm beating a dead horse at this point. Without a compelling example of why preserving this one in that file is important, the scales clearly tip in favor of preserving existing behavior and therefore not breaking existing uses. I will send a CL. |
Change https://golang.org/cl/78295 mentions this issue: |
CL sent. I also noticed while working on the CL that the earlier change made Reader incompatible with Writer. If you are writing with a Writer with UseCRLF set to true, it encodes "a\nb" as "a\r\nb", so that UseCRLF means every line ending in the output is CRLF. Reader correctly undid that in Go 1.9 so that data round-tripped. Reverting the earlier change preserves that property. |
@rsc suppose you're processing some data that includes the following within a field "hello\r\nfriend", how would I preserve that sequence of characters if I have to read it through and write it out of the CSV package? In your example, the difference is that lines 1 and 3 \r\n denotes the end of a record whereas in line 2 \r\n is data in an escaped field. The RFC says you're allowed to include a certain set of characters (including \r) in a field so I think stripping characters out of a field is a very surprising behaviour. |
@bourbonspecial the issue is that CSV files are text files. Text files are not binary. The two text files "foo\r\nbar" and "foo\nbar" are logically equivalent even though they are different from a binary perspective. Another example, the following two one line CSV files are equivalent:
but they physically differ. I personally do not believe there is a single correct answer to how to deal with carriage returns or newlines embedded in a string. We could have a "preserve carriage returns in fields" option, but that doesn't deal with other text handling code in a system that might canonicalize line endings in text files (FTP ASCII mode transfers are one such system). We have had one very loud complaint about end-of-line handling by the csv package, but very little information on the specifics of why this was a problem, if the problem was in fact other software that was fragile, or even what kind of data it was. If the data was binary then I would suggest that using CSV is a poor choice and the file format is one that makes a different compromise. We have a concern that the change of preserving carriage returns will cause existing programs to break, which violates the backwards compatibility guarantee of Go 1. I accept that I made a mistake when I said “You have pointed out a situation in which the CSV package is not following the RFC.” I now believe the RFC is ambiguous on this point. The prose and the BNF are in conflict. RFC4180 is in the context of MIME, and the file is a series of lines of text. A CSV field can span multiple lines of text. Lines of text do not contain CRLF or bare LF sequences. I can still make the argument for preserving CR's in fields, but absence any other information than what has been presented, the argument that they are lines of text and preserving the backwards compatibility guarantee tilts the resolution in favor of existing behavior. |
@pborman fair enough. My as-a-humble-user point of view is that there's some thought involved in "foo\r\nbar" == "foo\nbar", and if there's an opportunity to distance yourself from worrying about how other people might interpret those characters then it seems safer to not get involved. Totally accept that the risk of breaking existing programs is ever present so it's a trade off. |
@pborman FWIW I don't even think the BNF conflicts with any of this. The BNF just says that in order to be a valid CSV, any mid-value CR or LF bytes must be quoted. It does not say how they must be interpreted. |
Assertions My 2p The RFC detail for implementors about CRLF for line breaks being open to different line breaks is about line breaks delimiting records, it is not about escaped data. So I would expect:
I think a parser so generated would leave an escaped field containing \r\n unmodified. When dealing with end users the de-facto standard for CSV is not the IETF RFC it is 'whatever Excel does'. I think (hope) most developers who have a need to use CSV wish to do so because their end users are using it in Excel or they are doing some database ETL process (in which case you still want the escaped \r\n preserved). |
I trying to use Go for my csv processing tool. |
@noangel This issue is closed. We don't use issues for discussion. Please see https://golang.org/wiki/Questions . Thanks. The short answer to your question is that it turns out that there are hundreds of variants of CSV and there are hundreds of different ways that they are used. Supporting them all in the standard library would mean a package that is very difficult to understand and use correctly. So we've decided going forward to restrict the package only to the format described in RFC 4180. We understand that this means that the package can't work for everyone. The code is there and available for modification for your specific use case. |
See cockroachdb#25344. It appears this is being caused by the behaviour of Golang's encoding/csv library, which folds \r\n into \n when reading. This was fixed in golang/go#21201 but then reverted golang/go#22746. It appears based on that second issue that Go is unlikely to change that behavior. Check in the stdlib `encoding/csv` into `pkg/util` with golang/go#22746 reverted. Release note: `\r\n` characters in CSV files were silently converted into `\n`. This causes imported data to be different. This is now fixed.
See cockroachdb#25344. It appears this is being caused by the behaviour of Golang's encoding/csv library, which folds \r\n into \n when reading. This was fixed in golang/go#21201 but then reverted golang/go#22746. It appears based on that second issue that Go is unlikely to change that behavior. Check in the stdlib `encoding/csv` into `pkg/util` with golang/go#22746 reverted. Release note: `\r\n` characters in CSV files were silently converted into `\n`. This causes imported data to be different. This is now fixed.
See cockroachdb#25344. It appears this is being caused by the behaviour of Golang's encoding/csv library, which folds \r\n into \n when reading. This was fixed in golang/go#21201 but then reverted golang/go#22746. It appears based on that second issue that Go is unlikely to change that behavior. Check in the stdlib `encoding/csv` into `pkg/util` with golang/go#22746 reverted. Release note: `\r\n` characters in CSV files were silently converted into `\n`. This causes imported data to be different. This is now fixed.
In #21201, @alexhultman filed a fairly blunt bug report that “I want the verbatim data as in the CSV file, not some made up corrupt data.” because the CSV reader changes \r\n to \n inside multiline quoted strings.
@nussjustin said on the issue that this behavior was unexpected, and @pborman seemed to agree that it was wrong, but I cannot figure out why. Later in the issue @pborman says “You have pointed out a situation in which the CSV package is not following the RFC.” but I can't see anything in the RFC that says \r must be preserved in this case.
About carriage returns the RFC says (in total!):
“Each record is located on a separate line, delimited by a line break (CRLF).”
“The last record in the file may or may not have an ending line break.”
“Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.”
“As per section 4.1.1. of RFC 2046 [3], this media type uses CRLF to denote line breaks. However, implementors should be aware that some implementations may use other values.”
The entire ABNF grammar is:
Nothing in the RFC text or the grammar says that CRLF inside a quoted string must not be shortened to LF. What the RFC does say is that CRLF is the way line breaks are denoted and that some implementations use other ones. Consistent with that, the Go library rewrites all CRLF to just LF, because that is the standard form inside Go programs.
@alexhultman didn't give a rationale for keeping the \r, and I can't think of one.
Everything else in Go uses \n as the line separator. It makes sense to treat inputs here using \r\n as the line separator as an external encoding to be converted and to apply the \r\n → \n conversion to all lines, including quoted strings. CSV files are a special format text file. If you "convert" a CSV file from Unix line endings to Windows line endings, a convenient property would be that reading the CSV data produces the same result for both versions of the file. Go 1.9 and earlier provided that. We do the same processing for \r inside multiline backquote strings in Go source code, for the same reason.
I believe that if we keep the change in CL 52810, this will surprise users who have CSV data from some generator that uses \r\n line endings and now have unexpected \r characters in their multiline values that they will need to postprocess away.
I think we should revert this CL.
The text was updated successfully, but these errors were encountered: