sql: NFC-normalization on double-quoted identifiers #71140

e-mbrown · 2021-10-05T15:37:48Z

Resolves: #55396

There was a lack of NFC-normalization on double-quoted
identifiers. This allowed for ambiguous table/db/column names.
This change adds normalization and prevents this case.

Release note: None

cockroach-teamcity · 2021-10-05T15:37:55Z

This change is

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)

pkg/sql/normalization_test.go, line 39 at r1 (raw file):

		t.Fatal(err)
	}
	defer db.Close()

nit: you can simplify the new test cluster and pgurl stuff with:

	s, db, _ := serverutils.StartServer(t, base.TestServerArgs{Insecure: true})
	defer s.Stopper().Stop(context.Background())
	defer db.Close()

pkg/sql/normalization_test.go, line 41 at r1 (raw file):

	defer db.Close()

	_, err = db.Exec("CREATE TABLE \"" + string("caf\u00E9") + "\" (a INT)")

nit: use

"CREATE TABLE \"caf\u00E9\" (a INT)"

pkg/sql/normalization_test.go, line 44 at r1 (raw file):

	require.NoError(t, err)

	_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)")

nit: use

"CREATE TABLE \"cafe\u0301\" (a INT)"

pkg/sql/normalization_test.go, line 45 at r1 (raw file):

	_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)")
	require.Errorf(t, err, "The tables should be considered duplicates when normalized")

let's also add a check for the error message:

require.True(t, strings.Contains(err.Error(), "already exists")

actually in addition, let's make the test create in this order:

	_, err = db.Exec("CREATE TABLE café (a INT)")
	_, err = db.Exec("CREATE TABLE cafe\u0301 (a INT)")
	_, err = db.Exec("CREATE TABLE caf\u00E9 (a INT)")
	_, err = db.Exec("CREATE TABLE \"caf\u00E9\" (a INT)")
	_, err = db.Exec("CREATE TABLE \"cafe\u0301\" (a INT)")

the first one should work, and all the other ones should get the error.

pkg/sql/normalization_test.go, line 46 at r1 (raw file):

	_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)")
	require.Errorf(t, err, "The tables should be considered duplicates when normalized")
}

i think we should add another test here:

	var b bool
	err = db.QueryRow("SELECT 'caf\u00E9' = 'cafe\u0301'").Scan(&b)
	require.NoError(t, err)
	require.False(t, b)

this will make sure that normal strings are not normalized

pkg/sql/scanner/scan.go, line 920 at r1 (raw file):

	}

	lval.SetStr(s.finishString(lexbase.NormalizeBytes(buf)))

i think this might be normalizing too many inputs -- it will apply to normal strings as well as identifiers. i think we need this kind of logic limited to only be in the identQuote case here

cockroach/pkg/sql/scanner/scan.go

Line 153 in f14a123

    
           if s.scanString(lval, identQuote, false /* allowEscapes */, true /* requireUTF8 */) {

also i noticed one other existing bug which i think we'll need to fix for this to work. this line should say if isLower && isASCII:

cockroach/pkg/sql/scanner/scan.go

Line 566 in f14a123

if isLower {

e-mbrown

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

pkg/sql/normalization_test.go, line 45 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

let's also add a check for the error message:
require.True(t, strings.Contains(err.Error(), "already exists")
actually in addition, let's make the test create in this order:
	_, err = db.Exec("CREATE TABLE café (a INT)")
	_, err = db.Exec("CREATE TABLE cafe\u0301 (a INT)")
	_, err = db.Exec("CREATE TABLE caf\u00E9 (a INT)")
	_, err = db.Exec("CREATE TABLE \"caf\u00E9\" (a INT)")
	_, err = db.Exec("CREATE TABLE \"cafe\u0301\" (a INT)")
the first one should work, and all the other ones should get the error.

Done.

pkg/sql/normalization_test.go, line 46 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i think we should add another test here:
	var b bool
	err = db.QueryRow("SELECT 'caf\u00E9' = 'cafe\u0301'").Scan(&b)
	require.NoError(t, err)
	require.False(t, b)
this will make sure that normal strings are not normalized

Done.

pkg/sql/scanner/scan.go, line 920 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i think this might be normalizing too many inputs -- it will apply to normal strings as well as identifiers. i think we need this kind of logic limited to only be in the identQuote case here

cockroach/pkg/sql/scanner/scan.go

Line 153 in f14a123

if s.scanString(lval, identQuote, false /* allowEscapes */, true /* requireUTF8 */) {

also i noticed one other existing bug which i think we'll need to fix for this to work. this line should say if isLower && isASCII:

cockroach/pkg/sql/scanner/scan.go

Line 566 in f14a123

if isLower {

Done.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)

pkg/sql/lexbase/normalize.go, line 57 at r2 (raw file):

}

//NormalizeBytes normalizes to lowercase and Unicode Normalization

nit: this doesn't turn things into lower case. actually, maybe we should add a test for that too by adding to the test so it creates:

table "Café" (with quotes) which should not fail
table Café (without quotes) which should fail

pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

//identifiers.
func NormalizeBytes(n []byte) []byte {
	return norm.NFC.Bytes(n)

do you know why we need NFC.bytes here? why not NFC.String, like the function above?

e-mbrown · 2021-10-06T18:17:11Z

pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

do you know why we need NFC.bytes here? why not NFC.String, like the function above?

We use NFC.bytes here because in scanString(scan.go line 829), buf isn't converted into a string until the end of the function using s.finishString. A lot of the other functions use s.finishString so I thought targeting buf beforehand was a better option

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)

pkg/sql/normalization_test.go, line 35 at r3 (raw file):

	require.NoError(t, err)

	_, err = db.Exec(`CREATE TABLE "Café" (a INT)`)

super tiny nit: but could you move the "Café" test so it's the last one? that way we can be sure that the Café test is conflicting with the table we expect it to

pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

Previously, e-mbrown wrote…

We use NFC.bytes here because in scanString(scan.go line 829), buf isn't converted into a string until the end of the function using s.finishString. A lot of the other functions use s.finishString so I thought targeting buf beforehand was a better option

ah it makes sense. i think we can make it use NFC.String though. that way we can add the optimization from the above function:

func NormalizeString(s string) string {
	if isASCII(s) {
		return s
	}
	return norm.NFC.String(s)

(we just don't want the convert to lower-case step)

then we change the call from scan.go so it's:

lval.SetStr(lexbase.NormalizeString(s.finishString(buf))

e-mbrown

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

pkg/sql/normalization_test.go, line 35 at r3 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

super tiny nit: but could you move the "Café" test so it's the last one? that way we can be sure that the Café test is conflicting with the table we expect it to

Done.

pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

ah it makes sense. i think we can make it use NFC.String though. that way we can add the optimization from the above function:
func NormalizeString(s string) string {
	if isASCII(s) {
		return s
	}
	return norm.NFC.String(s)
(we just don't want the convert to lower-case step)

then we change the call from scan.go so it's:
lval.SetStr(lexbase.NormalizeString(s.finishString(buf))

Done.

rafiss · 2021-10-08T21:53:41Z

the failing test is

# Unicode sequences are preserved.

statement ok
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

The two column names in that test have different unicode representations that normalize to the same thing, so now it fails with (42701) relation "amelie" (61): duplicate column name: "Amélie". @knz it seems that you added the test long ago, but later on you filed #55396

Am I correct to conclude that you no longer want that "Amélie" test to exist in its current form?

knz · 2021-10-11T04:18:17Z

Am I correct to conclude that you no longer want that "Amélie" test to exist in its current form?

I'd say we need to change the text to expect a duplicate column definition (i.e. check that the normalization occurs)

knz

Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)

pkg/sql/lexbase/normalize.go, line 57 at r4 (raw file):

}

//NormalizeString normalizes to Unicode Normalization Form C (NFC).

nit: 1 space after //

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown, @knz, and @rafiss)

pkg/ccl/logictestccl/testdata/logic_test/case_sensitive_names, line 20 at r5 (raw file):

  PARTITION p1 VALUES IN ((1)),
  PARTITION "P1" VALUES IN ((2)),
  PARTITION "Amélie" VALUES IN ((3))

i don't think we want to change this

but in pkg/sql/logictest/testdata/logic_test/case_sensitive_names we need to change these tests.

statement ok
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

# Check that the column names were encoded properly
query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Ame\xcc\x81lie'
----
1

query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Am\xc3\xa9lie'
----
2

# Check that the non-normalized names propagate throughout until results.

query II colnames
SELECT "Amélie", "Amélie" FROM Amelie
----
Amélie Amélie
2      1

the first one should become:

statement error duplicate column name: "Amélie"
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

and we can remove the other tests.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown, @knz, and @rafiss)

pkg/ccl/logictestccl/testdata/logic_test/case_sensitive_names, line 20 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i don't think we want to change this

but in pkg/sql/logictest/testdata/logic_test/case_sensitive_names we need to change these tests.

statement ok
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

# Check that the column names were encoded properly
query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Ame\xcc\x81lie'
----
1

query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Am\xc3\xa9lie'
----
2

# Check that the non-normalized names propagate throughout until results.

query II colnames
SELECT "Amélie", "Amélie" FROM Amelie
----
Amélie Amélie
2      1

the first one should become:

statement error duplicate column name: "Amélie"
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)

and we can remove the other tests.

oh my bad, you can keep the change in this file too. didn't see that it was a change in the expected result

There was a lack of NFC-normalization on double-quoted identifiers. This allowed for ambiguous table/db/column names. This change adds normalization and prevents this case. Release note: None

rafiss

lgtm! feel free to merge once CI passes

e-mbrown · 2021-10-11T19:28:13Z

bors r=rafiss

craig · 2021-10-11T22:19:06Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-10-12T01:35:05Z

Build succeeded:

GitHub CI (Cockroach)

e-mbrown requested a review from rafiss October 5, 2021 17:30

rafiss requested changes Oct 5, 2021

View reviewed changes

e-mbrown force-pushed the eb/rolespec branch from f8a3c59 to 9c7493d Compare October 6, 2021 16:56

e-mbrown commented Oct 6, 2021

View reviewed changes

e-mbrown marked this pull request as ready for review October 6, 2021 16:56

rafiss requested changes Oct 6, 2021

View reviewed changes

e-mbrown force-pushed the eb/rolespec branch from 9c7493d to 99659d8 Compare October 6, 2021 18:23

rafiss requested changes Oct 6, 2021

View reviewed changes

e-mbrown force-pushed the eb/rolespec branch from 99659d8 to 1c56cef Compare October 6, 2021 20:39

e-mbrown commented Oct 6, 2021

View reviewed changes

knz reviewed Oct 11, 2021

View reviewed changes

e-mbrown force-pushed the eb/rolespec branch from 1c56cef to 5f50285 Compare October 11, 2021 13:44

rafiss requested changes Oct 11, 2021

View reviewed changes

e-mbrown force-pushed the eb/rolespec branch from 5f50285 to 8d91c2c Compare October 11, 2021 17:26

sql: NFC-normalization on double-quoted identifiers

808ef25

There was a lack of NFC-normalization on double-quoted identifiers. This allowed for ambiguous table/db/column names. This change adds normalization and prevents this case. Release note: None

e-mbrown force-pushed the eb/rolespec branch from 8d91c2c to 808ef25 Compare October 11, 2021 18:28

rafiss approved these changes Oct 11, 2021

View reviewed changes

craig bot merged commit 6159a73 into cockroachdb:master Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: NFC-normalization on double-quoted identifiers #71140

sql: NFC-normalization on double-quoted identifiers #71140

e-mbrown commented Oct 5, 2021

cockroach-teamcity commented Oct 5, 2021

rafiss left a comment

e-mbrown left a comment

rafiss left a comment

e-mbrown commented Oct 6, 2021

rafiss left a comment

e-mbrown left a comment

rafiss commented Oct 8, 2021 •

edited

Loading

knz commented Oct 11, 2021

knz left a comment

rafiss left a comment

rafiss left a comment

rafiss left a comment

e-mbrown commented Oct 11, 2021

craig bot commented Oct 11, 2021

craig bot commented Oct 12, 2021

sql: NFC-normalization on double-quoted identifiers #71140

sql: NFC-normalization on double-quoted identifiers #71140

Conversation

e-mbrown commented Oct 5, 2021

cockroach-teamcity commented Oct 5, 2021

rafiss left a comment

Choose a reason for hiding this comment

e-mbrown left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

e-mbrown commented Oct 6, 2021

rafiss left a comment

Choose a reason for hiding this comment

e-mbrown left a comment

Choose a reason for hiding this comment

rafiss commented Oct 8, 2021 • edited Loading

knz commented Oct 11, 2021

knz left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

e-mbrown commented Oct 11, 2021

craig bot commented Oct 11, 2021

craig bot commented Oct 12, 2021

rafiss commented Oct 8, 2021 •

edited

Loading