Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for CRLF's #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

kyleder
Copy link

@kyleder kyleder commented Mar 5, 2016

In the current code, when a CRLF is encountered the result will be a LFLF. The Go CSV Reader will handle an unquoted LFLF by ignoring the empty line, but if the CRLF occurs inside of a quoted string then the LFLF will be maintained.

This change accounts for that by ignoring a CR that is followed by a LF. The resulting CRLF is properly handled by the CSV Reader which strips the preceding CR: "Carriage returns before newline characters are silently removed." (source).

If the next byte after a carriage return (CR) is a line feed (LF)
then don't replace the CR. The result will be a CRLF which is
supported by the Go CSV reader.
@@ -27,7 +27,7 @@ func New(r io.Reader) io.Reader {
func (r reader) Read(p []byte) (n int, err error) {
n, err = r.r.Read(p)
for i, b := range p {
if b == rByte {
if j := i + 1; b == rByte && ((j < len(p) && p[j] != nByte) || j == len(p)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, I think this won't work in all cases. Read() is allowed to return less than len(p) bytes, even if it's not the end of the underlying Reader. So it can happen that a \r falls on the last byte of a Read() call, then the \n is the first byte of the next Read() call.

You can test this by creating your own Reader implementation to pass into this for testing which simulates this unfortunate case.

One fix is to require that this reader take a bufio.Reader as input rather than a io.Reader so that you can always call Peek() to get the next byte (I think that works).

Another fix could be to continue taking io.Reader and then test whether it can be cast in bufio.Reader and if not manually wrapping it.

@ctessum
Copy link
Owner

ctessum commented Mar 15, 2016

Sorry for the delayed response. Could you explain the motivation behind this PR? The original motivation for the package is that the current Mac version of Excel outputs CSV files that the standard library CSV package can't read. Is there a program that outputs CSV files that this PR can parse correctly but the master branch can't?

@MaerF0x0
Copy link

I'm encountering this now myself. The issue is if you do not know the exact newline char sequence your CSV has. For example in user generated content there could be basically any of (\r, \n, \r\n) as well you may or may not want to preserve the values within each field.

Eg:

a,b,c,d\r
a1,b1,"c1\r",d\r

==>

a,b,c,d\n
a1,b1,"c1\r",d\n

@Kaushal28
Copy link

@ctessum I don't think there is any issue in the master branch but when we use this reader, it'll create extra blank lines because of CRLF to LFLF replacements. However, I think the Go CSV reader ignores these empty lines as mentioned by the author of this PR (@aboodman) but it increments the line number, and hence in ParseError (refer: https://pkg.go.dev/encoding/csv#ParseError) it'd return incorrect line numbers. So for example, if any error occurs at actual line number 2, it'll think the error was at line 3 because of the black line added after line 1 replacement. So except for the first line, every line number would be incorrect because every line inserts a new line.

@aboodman, I was thinking about the same fix that you did but didn't know about the point you mentioned in the comment. So is there any workaround for that issue? How did you fix that? I'm not sure how to cast my io.Reader to bufio.Reader.

@ctessum
Copy link
Owner

ctessum commented Dec 22, 2021

So the original goal of this package was to be able to read CSV files created by the Mac version of Excel. I guess my question is: does the Mac version of Excel ever output CRLF's? If it does, then it would be within the scope of this package to handle them, but if not, then this change or any related change would be out of scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants