Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: NFC-normalization on double-quoted identifiers #71140

Merged
merged 1 commit into from
Oct 12, 2021

Conversation

e-mbrown
Copy link
Contributor

@e-mbrown e-mbrown commented Oct 5, 2021

Resolves: #55396

There was a lack of NFC-normalization on double-quoted
identifiers. This allowed for ambiguous table/db/column names.
This change adds normalization and prevents this case.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@e-mbrown e-mbrown requested a review from rafiss October 5, 2021 17:30
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)


pkg/sql/normalization_test.go, line 39 at r1 (raw file):

		t.Fatal(err)
	}
	defer db.Close()

nit: you can simplify the new test cluster and pgurl stuff with:

	s, db, _ := serverutils.StartServer(t, base.TestServerArgs{Insecure: true})
	defer s.Stopper().Stop(context.Background())
	defer db.Close()

pkg/sql/normalization_test.go, line 41 at r1 (raw file):

	defer db.Close()

	_, err = db.Exec("CREATE TABLE \"" + string("caf\u00E9") + "\" (a INT)")

nit: use

"CREATE TABLE \"caf\u00E9\" (a INT)"

pkg/sql/normalization_test.go, line 44 at r1 (raw file):

	require.NoError(t, err)

	_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)")

nit: use

"CREATE TABLE \"cafe\u0301\" (a INT)"

pkg/sql/normalization_test.go, line 45 at r1 (raw file):

	_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)")
	require.Errorf(t, err, "The tables should be considered duplicates when normalized")

let's also add a check for the error message:

require.True(t, strings.Contains(err.Error(), "already exists")

actually in addition, let's make the test create in this order:

	_, err = db.Exec("CREATE TABLE café (a INT)")
	_, err = db.Exec("CREATE TABLE cafe\u0301 (a INT)")
	_, err = db.Exec("CREATE TABLE caf\u00E9 (a INT)")
	_, err = db.Exec("CREATE TABLE \"caf\u00E9\" (a INT)")
	_, err = db.Exec("CREATE TABLE \"cafe\u0301\" (a INT)")

the first one should work, and all the other ones should get the error.


pkg/sql/normalization_test.go, line 46 at r1 (raw file):

	_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)")
	require.Errorf(t, err, "The tables should be considered duplicates when normalized")
}

i think we should add another test here:

	var b bool
	err = db.QueryRow("SELECT 'caf\u00E9' = 'cafe\u0301'").Scan(&b)
	require.NoError(t, err)
	require.False(t, b)

this will make sure that normal strings are not normalized


pkg/sql/scanner/scan.go, line 920 at r1 (raw file):

	}

	lval.SetStr(s.finishString(lexbase.NormalizeBytes(buf)))

i think this might be normalizing too many inputs -- it will apply to normal strings as well as identifiers. i think we need this kind of logic limited to only be in the identQuote case here

if s.scanString(lval, identQuote, false /* allowEscapes */, true /* requireUTF8 */) {

also i noticed one other existing bug which i think we'll need to fix for this to work. this line should say if isLower && isASCII:

if isLower {

Copy link
Contributor Author

@e-mbrown e-mbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/normalization_test.go, line 45 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

let's also add a check for the error message:

require.True(t, strings.Contains(err.Error(), "already exists")

actually in addition, let's make the test create in this order:

	_, err = db.Exec("CREATE TABLE café (a INT)")
	_, err = db.Exec("CREATE TABLE cafe\u0301 (a INT)")
	_, err = db.Exec("CREATE TABLE caf\u00E9 (a INT)")
	_, err = db.Exec("CREATE TABLE \"caf\u00E9\" (a INT)")
	_, err = db.Exec("CREATE TABLE \"cafe\u0301\" (a INT)")

the first one should work, and all the other ones should get the error.

Done.


pkg/sql/normalization_test.go, line 46 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i think we should add another test here:

	var b bool
	err = db.QueryRow("SELECT 'caf\u00E9' = 'cafe\u0301'").Scan(&b)
	require.NoError(t, err)
	require.False(t, b)

this will make sure that normal strings are not normalized

Done.


pkg/sql/scanner/scan.go, line 920 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i think this might be normalizing too many inputs -- it will apply to normal strings as well as identifiers. i think we need this kind of logic limited to only be in the identQuote case here

if s.scanString(lval, identQuote, false /* allowEscapes */, true /* requireUTF8 */) {

also i noticed one other existing bug which i think we'll need to fix for this to work. this line should say if isLower && isASCII:

if isLower {

Done.

@e-mbrown e-mbrown marked this pull request as ready for review October 6, 2021 16:56
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)


pkg/sql/lexbase/normalize.go, line 57 at r2 (raw file):

}

//NormalizeBytes normalizes to lowercase and Unicode Normalization

nit: this doesn't turn things into lower case. actually, maybe we should add a test for that too by adding to the test so it creates:

  • table "Café" (with quotes) which should not fail
  • table Café (without quotes) which should fail

pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

//identifiers.
func NormalizeBytes(n []byte) []byte {
	return norm.NFC.Bytes(n)

do you know why we need NFC.bytes here? why not NFC.String, like the function above?

@e-mbrown
Copy link
Contributor Author

e-mbrown commented Oct 6, 2021


pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

do you know why we need NFC.bytes here? why not NFC.String, like the function above?

We use NFC.bytes here because in scanString(scan.go line 829), buf isn't converted into a string until the end of the function using s.finishString. A lot of the other functions use s.finishString so I thought targeting buf beforehand was a better option

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)


pkg/sql/normalization_test.go, line 35 at r3 (raw file):

	require.NoError(t, err)

	_, err = db.Exec(`CREATE TABLE "Café" (a INT)`)

super tiny nit: but could you move the "Café" test so it's the last one? that way we can be sure that the Café test is conflicting with the table we expect it to


pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

Previously, e-mbrown wrote…

We use NFC.bytes here because in scanString(scan.go line 829), buf isn't converted into a string until the end of the function using s.finishString. A lot of the other functions use s.finishString so I thought targeting buf beforehand was a better option

ah it makes sense. i think we can make it use NFC.String though. that way we can add the optimization from the above function:

func NormalizeString(s string) string {
	if isASCII(s) {
		return s
	}
	return norm.NFC.String(s)

(we just don't want the convert to lower-case step)

then we change the call from scan.go so it's:

lval.SetStr(lexbase.NormalizeString(s.finishString(buf))

Copy link
Contributor Author

@e-mbrown e-mbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/normalization_test.go, line 35 at r3 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

super tiny nit: but could you move the "Café" test so it's the last one? that way we can be sure that the Café test is conflicting with the table we expect it to

Done.


pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

ah it makes sense. i think we can make it use NFC.String though. that way we can add the optimization from the above function:

func NormalizeString(s string) string {
	if isASCII(s) {
		return s
	}
	return norm.NFC.String(s)

(we just don't want the convert to lower-case step)

then we change the call from scan.go so it's:

lval.SetStr(lexbase.NormalizeString(s.finishString(buf))

Done.

@rafiss
Copy link
Collaborator

rafiss commented Oct 8, 2021

the failing test is

# Unicode sequences are preserved.

statement ok
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

The two column names in that test have different unicode representations that normalize to the same thing, so now it fails with (42701) relation "amelie" (61): duplicate column name: "Amélie". @knz it seems that you added the test long ago, but later on you filed #55396

Am I correct to conclude that you no longer want that "Amélie" test to exist in its current form?

@knz
Copy link
Contributor

knz commented Oct 11, 2021

Am I correct to conclude that you no longer want that "Amélie" test to exist in its current form?

I'd say we need to change the text to expect a duplicate column definition (i.e. check that the normalization occurs)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)


pkg/sql/lexbase/normalize.go, line 57 at r4 (raw file):

}

//NormalizeString normalizes to Unicode Normalization Form C (NFC).

nit: 1 space after //

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown, @knz, and @rafiss)


pkg/ccl/logictestccl/testdata/logic_test/case_sensitive_names, line 20 at r5 (raw file):

  PARTITION p1 VALUES IN ((1)),
  PARTITION "P1" VALUES IN ((2)),
  PARTITION "Amélie" VALUES IN ((3))

i don't think we want to change this

but in pkg/sql/logictest/testdata/logic_test/case_sensitive_names we need to change these tests.

statement ok
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

# Check that the column names were encoded properly
query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Ame\xcc\x81lie'
----
1

query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Am\xc3\xa9lie'
----
2

# Check that the non-normalized names propagate throughout until results.

query II colnames
SELECT "Amélie", "Amélie" FROM Amelie
----
Amélie Amélie
2      1

the first one should become:

statement error duplicate column name: "Amélie"
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

and we can remove the other tests.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown, @knz, and @rafiss)


pkg/ccl/logictestccl/testdata/logic_test/case_sensitive_names, line 20 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i don't think we want to change this

but in pkg/sql/logictest/testdata/logic_test/case_sensitive_names we need to change these tests.

statement ok
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

# Check that the column names were encoded properly
query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Ame\xcc\x81lie'
----
1

query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Am\xc3\xa9lie'
----
2

# Check that the non-normalized names propagate throughout until results.

query II colnames
SELECT "Amélie", "Amélie" FROM Amelie
----
Amélie Amélie
2      1

the first one should become:

statement error duplicate column name: "Amélie"
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

and we can remove the other tests.

oh my bad, you can keep the change in this file too. didn't see that it was a change in the expected result

There was a lack of NFC-normalization on double-quoted
identifiers. This allowed for ambiguous table/db/column names.
This change adds normalization and prevents this case.

Release note: None
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! feel free to merge once CI passes

@e-mbrown
Copy link
Contributor Author

bors r=rafiss

@craig
Copy link
Contributor

craig bot commented Oct 11, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 12, 2021

Build succeeded:

@craig craig bot merged commit 6159a73 into cockroachdb:master Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/lex: quoted SQL identifiers are not NFC-normalized
4 participants