Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importer: fix scanning of newlines in enclosed fields for DELIMITED DATA #97545

Merged
merged 1 commit into from
Mar 1, 2023

Conversation

andyyang890
Copy link
Collaborator

@andyyang890 andyyang890 commented Feb 23, 2023

Previously, when using IMPORT INTO ... DELIMITED DATA, unescaped
newlines within quoted fields were treated as row terminators instead
of as part of the field. This patch refactors the scanning logic to
correctly interpret these newlines.

Fixes #95906

Release note (bug fix): IMPORT INTO ... DELIMITED DATA will now
correctly handle quoted fields that contain unescaped newlines.

@blathers-crl
Copy link

blathers-crl bot commented Feb 23, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andyyang890 andyyang890 force-pushed the import_delimited_data branch 6 times, most recently from 1e60ec9 to 6aa2c3a Compare February 27, 2023 21:59
@andyyang890 andyyang890 changed the title importer: fix parsing of newlines in enclosed fields for DELIMITED DATA importer: fix scanning of newlines in enclosed fields for DELIMITED DATA Feb 27, 2023
@andyyang890 andyyang890 marked this pull request as ready for review February 27, 2023 22:01
@andyyang890 andyyang890 requested review from a team and rafiss February 27, 2023 22:01
@andyyang890 andyyang890 force-pushed the import_delimited_data branch from 6aa2c3a to bec4f76 Compare February 27, 2023 23:49
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, glad this could get fixed! i had some small comments

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890)


pkg/sql/importer/read_import_mysqlout.go line 87 at r1 (raw file):

	row       []rune
	err       error
	eof       bool

nice cleanup! this eof field really seems needlessly complicated


pkg/sql/importer/read_import_mysqlout.go line 96 at r1 (raw file):

	var r rune
	var w int
	var gotEncloser, inEnclosedField, inEscapeSeq, inField bool

nit: could you add a small descriptive comment for each one of these flags? i think it will make the logic of the for loop easier to understand. specifically, the definitions of "encloser" and the difference between inEnclosedField vs inField would help.


pkg/sql/importer/read_import_mysqlout.go line 101 at r1 (raw file):

		r, w, d.err = d.reader.ReadRune()
		if d.err == io.EOF {
			d.err = nil

should we add the if nextLiteral check here and set ErrUnexpectedEOF if true?


pkg/sql/importer/read_import_mysqlout.go line 141 at r1 (raw file):

		}

		inField = !(r == d.opts.FieldSeparator)

nit: use r != d.opts.FieldSeparator

Previously, when using `IMPORT INTO ... DELIMITED DATA`, unescaped
newlines within quoted fields were treated as row terminators instead
of as part of the field. This patch refactors the scanning logic to
correctly interpret these newlines.

Release note (bug fix): `IMPORT INTO ... DELIMITED DATA` will now
correctly handle quoted fields that contain unescaped newlines.
@andyyang890 andyyang890 force-pushed the import_delimited_data branch from bec4f76 to 748a496 Compare February 28, 2023 22:25
Copy link
Collaborator Author

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/importer/read_import_mysqlout.go line 96 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: could you add a small descriptive comment for each one of these flags? i think it will make the logic of the for loop easier to understand. specifically, the definitions of "encloser" and the difference between inEnclosedField vs inField would help.

Done.


pkg/sql/importer/read_import_mysqlout.go line 101 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

should we add the if nextLiteral check here and set ErrUnexpectedEOF if true?

I think that would actually be incorrect since it would disallow ending a file with the escape character. From my reading of the logic before my change, the check here would never actually be reached since it would mean d.row == nil (i.e. the row is empty) yet nextLiteral is true (i.e. the row had an escape character), which is why I deleted it.


pkg/sql/importer/read_import_mysqlout.go line 141 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: use r != d.opts.FieldSeparator

Done.

@andyyang890 andyyang890 requested a review from rafiss March 1, 2023 18:05
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, the comments help me understand better now. i just had one request for another test, and should be good to merge after that

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890)


pkg/sql/importer/read_import_mysqlout.go line 101 at r1 (raw file):

I think that would actually be incorrect since it would disallow ending a file with the escape character

hm, but it seems correct to disallow ending a file with the escape character. at least, no matter what we decide, let's have a test case for that situation.

Copy link
Collaborator Author

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/importer/read_import_mysqlout.go line 101 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

I think that would actually be incorrect since it would disallow ending a file with the escape character

hm, but it seems correct to disallow ending a file with the escape character. at least, no matter what we decide, let's have a test case for that situation.

Oh you're right, this is indeed an error case and there's already a test for it here:

{
name: "unmatched literal",
create: `i int8`,
with: `WITH fields_escaped_by = '\'`,
typ: "DELIMITED",
data: `\`,
err: "row 1: unmatched literal",
rejected: "\\\n",
query: map[string][][]string{`SELECT * from t`: {}},
},

I believe the check for this is deferred to FillDatums to return a more informative error message (i.e. unmatched literal).

@andyyang890 andyyang890 requested a review from rafiss March 1, 2023 22:32
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking! lgtm

@andyyang890
Copy link
Collaborator Author

TFTR!

bors r=rafiss

@craig
Copy link
Contributor

craig bot commented Mar 1, 2023

Build succeeded:

@craig craig bot merged commit 65752af into cockroachdb:master Mar 1, 2023
@andyyang890 andyyang890 deleted the import_delimited_data branch March 2, 2023 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: IMPORT INTO does not properly handle newline terminated text in a delimited CSV
3 participants