-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: NFC-normalization on double-quoted identifiers #71140
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)
pkg/sql/normalization_test.go, line 39 at r1 (raw file):
t.Fatal(err) } defer db.Close()
nit: you can simplify the new test cluster and pgurl stuff with:
s, db, _ := serverutils.StartServer(t, base.TestServerArgs{Insecure: true})
defer s.Stopper().Stop(context.Background())
defer db.Close()
pkg/sql/normalization_test.go, line 41 at r1 (raw file):
defer db.Close() _, err = db.Exec("CREATE TABLE \"" + string("caf\u00E9") + "\" (a INT)")
nit: use
"CREATE TABLE \"caf\u00E9\" (a INT)"
pkg/sql/normalization_test.go, line 44 at r1 (raw file):
require.NoError(t, err) _, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)")
nit: use
"CREATE TABLE \"cafe\u0301\" (a INT)"
pkg/sql/normalization_test.go, line 45 at r1 (raw file):
_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)") require.Errorf(t, err, "The tables should be considered duplicates when normalized")
let's also add a check for the error message:
require.True(t, strings.Contains(err.Error(), "already exists")
actually in addition, let's make the test create in this order:
_, err = db.Exec("CREATE TABLE café (a INT)")
_, err = db.Exec("CREATE TABLE cafe\u0301 (a INT)")
_, err = db.Exec("CREATE TABLE caf\u00E9 (a INT)")
_, err = db.Exec("CREATE TABLE \"caf\u00E9\" (a INT)")
_, err = db.Exec("CREATE TABLE \"cafe\u0301\" (a INT)")
the first one should work, and all the other ones should get the error.
pkg/sql/normalization_test.go, line 46 at r1 (raw file):
_, err = db.Exec("CREATE TABLE \"" + string("cafe\u0301") + "\" (a INT)") require.Errorf(t, err, "The tables should be considered duplicates when normalized") }
i think we should add another test here:
var b bool
err = db.QueryRow("SELECT 'caf\u00E9' = 'cafe\u0301'").Scan(&b)
require.NoError(t, err)
require.False(t, b)
this will make sure that normal strings are not normalized
pkg/sql/scanner/scan.go, line 920 at r1 (raw file):
} lval.SetStr(s.finishString(lexbase.NormalizeBytes(buf)))
i think this might be normalizing too many inputs -- it will apply to normal strings as well as identifiers. i think we need this kind of logic limited to only be in the identQuote case here
cockroach/pkg/sql/scanner/scan.go
Line 153 in f14a123
if s.scanString(lval, identQuote, false /* allowEscapes */, true /* requireUTF8 */) { |
also i noticed one other existing bug which i think we'll need to fix for this to work. this line should say if isLower && isASCII
:
cockroach/pkg/sql/scanner/scan.go
Line 566 in f14a123
if isLower { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/sql/normalization_test.go, line 45 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
let's also add a check for the error message:
require.True(t, strings.Contains(err.Error(), "already exists")
actually in addition, let's make the test create in this order:
_, err = db.Exec("CREATE TABLE café (a INT)") _, err = db.Exec("CREATE TABLE cafe\u0301 (a INT)") _, err = db.Exec("CREATE TABLE caf\u00E9 (a INT)") _, err = db.Exec("CREATE TABLE \"caf\u00E9\" (a INT)") _, err = db.Exec("CREATE TABLE \"cafe\u0301\" (a INT)")
the first one should work, and all the other ones should get the error.
Done.
pkg/sql/normalization_test.go, line 46 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i think we should add another test here:
var b bool err = db.QueryRow("SELECT 'caf\u00E9' = 'cafe\u0301'").Scan(&b) require.NoError(t, err) require.False(t, b)
this will make sure that normal strings are not normalized
Done.
pkg/sql/scanner/scan.go, line 920 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i think this might be normalizing too many inputs -- it will apply to normal strings as well as identifiers. i think we need this kind of logic limited to only be in the identQuote case here
cockroach/pkg/sql/scanner/scan.go
Line 153 in f14a123
if s.scanString(lval, identQuote, false /* allowEscapes */, true /* requireUTF8 */) { also i noticed one other existing bug which i think we'll need to fix for this to work. this line should say
if isLower && isASCII
:cockroach/pkg/sql/scanner/scan.go
Line 566 in f14a123
if isLower {
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)
pkg/sql/lexbase/normalize.go, line 57 at r2 (raw file):
} //NormalizeBytes normalizes to lowercase and Unicode Normalization
nit: this doesn't turn things into lower case. actually, maybe we should add a test for that too by adding to the test so it creates:
- table
"Café"
(with quotes) which should not fail - table Café (without quotes) which should fail
pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):
//identifiers. func NormalizeBytes(n []byte) []byte { return norm.NFC.Bytes(n)
do you know why we need NFC.bytes here? why not NFC.String, like the function above?
pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file): Previously, rafiss (Rafi Shamim) wrote…
We use NFC.bytes here because in scanString(scan.go line 829), buf isn't converted into a string until the end of the function using s.finishString. A lot of the other functions use s.finishString so I thought targeting buf beforehand was a better option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)
pkg/sql/normalization_test.go, line 35 at r3 (raw file):
require.NoError(t, err) _, err = db.Exec(`CREATE TABLE "Café" (a INT)`)
super tiny nit: but could you move the "Café"
test so it's the last one? that way we can be sure that the Café
test is conflicting with the table we expect it to
pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):
Previously, e-mbrown wrote…
We use NFC.bytes here because in scanString(scan.go line 829), buf isn't converted into a string until the end of the function using s.finishString. A lot of the other functions use s.finishString so I thought targeting buf beforehand was a better option
ah it makes sense. i think we can make it use NFC.String though. that way we can add the optimization from the above function:
func NormalizeString(s string) string {
if isASCII(s) {
return s
}
return norm.NFC.String(s)
(we just don't want the convert to lower-case step)
then we change the call from scan.go so it's:
lval.SetStr(lexbase.NormalizeString(s.finishString(buf))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/sql/normalization_test.go, line 35 at r3 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
super tiny nit: but could you move the
"Café"
test so it's the last one? that way we can be sure that theCafé
test is conflicting with the table we expect it to
Done.
pkg/sql/lexbase/normalize.go, line 61 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
ah it makes sense. i think we can make it use NFC.String though. that way we can add the optimization from the above function:
func NormalizeString(s string) string { if isASCII(s) { return s } return norm.NFC.String(s)
(we just don't want the convert to lower-case step)
then we change the call from scan.go so it's:
lval.SetStr(lexbase.NormalizeString(s.finishString(buf))
Done.
the failing test is
The two column names in that test have different unicode representations that normalize to the same thing, so now it fails with Am I correct to conclude that you no longer want that |
I'd say we need to change the text to expect a duplicate column definition (i.e. check that the normalization occurs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown and @rafiss)
pkg/sql/lexbase/normalize.go, line 57 at r4 (raw file):
} //NormalizeString normalizes to Unicode Normalization Form C (NFC).
nit: 1 space after //
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown, @knz, and @rafiss)
pkg/ccl/logictestccl/testdata/logic_test/case_sensitive_names, line 20 at r5 (raw file):
PARTITION p1 VALUES IN ((1)), PARTITION "P1" VALUES IN ((2)), PARTITION "Amélie" VALUES IN ((3))
i don't think we want to change this
but in pkg/sql/logictest/testdata/logic_test/case_sensitive_names we need to change these tests.
statement ok
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)
# Check that the column names were encoded properly
query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Ame\xcc\x81lie'
----
1
query I
SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Am\xc3\xa9lie'
----
2
# Check that the non-normalized names propagate throughout until results.
query II colnames
SELECT "Amélie", "Amélie" FROM Amelie
----
Amélie Amélie
2 1
the first one should become:
statement error duplicate column name: "Amélie"
CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)
and we can remove the other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @e-mbrown, @knz, and @rafiss)
pkg/ccl/logictestccl/testdata/logic_test/case_sensitive_names, line 20 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i don't think we want to change this
but in pkg/sql/logictest/testdata/logic_test/case_sensitive_names we need to change these tests.
statement ok CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2) # Check that the column names were encoded properly query I SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Ame\xcc\x81lie' ---- 1 query I SELECT ordinal_position FROM information_schema.columns WHERE table_name = 'amelie' AND column_name::BYTES = b'Am\xc3\xa9lie' ---- 2 # Check that the non-normalized names propagate throughout until results. query II colnames SELECT "Amélie", "Amélie" FROM Amelie ---- Amélie Amélie 2 1
the first one should become:
statement error duplicate column name: "Amélie" CREATE TABLE Amelie("Amélie" INT, "Amélie" INT); INSERT INTO Amelie VALUES (1, 2)
and we can remove the other tests.
oh my bad, you can keep the change in this file too. didn't see that it was a change in the expected result
There was a lack of NFC-normalization on double-quoted identifiers. This allowed for ambiguous table/db/column names. This change adds normalization and prevents this case. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! feel free to merge once CI passes
bors r=rafiss |
Build failed (retrying...): |
Build succeeded: |
Resolves: #55396
There was a lack of NFC-normalization on double-quoted
identifiers. This allowed for ambiguous table/db/column names.
This change adds normalization and prevents this case.
Release note: None