sql: Unify the SQL type system #36598

andy-kimball · 2019-04-07T13:18:41Z

This PR unifies the three different SQL type systems in CRDB:

sqlbase.ColumnType - protobuf representation of table column types
coltypes.T - in-memory representation of table column types
types.T - computation types

The type systems are quite similar, but also subtly different. This causes bugs on a regular basis throughout the codebase. Each type system supports a different subset of the PG type functionality. For example, sqlbase.ColumnType stores column widths. types.T does not store widths, but it can store a Postgres OID value, which sqlbase.ColumnType is unable to do.

In addition, even for supported functionality, the API's are often tricky and error-prone. For example, types.T values can sometimes be compared using ==, but other times Equivalent must be called, and it is up to the user to know when to use which. As another example, support for OID values require wrapping types.T instances in a special TOIDWrapper type. Our developers are expected to know when it needs to be done, but typically they either don't understand when it's needed, or else forget to do it. It also breaks == comparison semantics, leading to subtle bugs that only manifest in edge cases with unusual OID values.

Another big problem that comes up repeatedly are lossy conversions. Throughout the codebase, there are frequent conversions to/from the various type systems. Many conversions discard important type information, which often leads to incompatibilities with Postgres, as well as providing a breeding ground for bugs.

This PR unifies the three type systems into just one. The unified type is called types.T (in new sql/types package). It uses a protobuf-generated struct that is backwards-compatible with sqlbase.ColumnType so that it can be embedded in place of ColumnType in larger protobuf messages (like sqlbase.ColumnDescriptor). The API is locked down so that it's difficult to ignorantly or accidentally create invalid types. The unified type supports everything that the three previous type systems did, and is more compatible with PG as a result (with more work, even more type compatibility is possible).

Resolves #32639

cockroach-teamcity · 2019-04-07T13:19:04Z

This change is

andy-kimball · 2019-04-07T13:23:17Z

Note to reviewers:
I know this is a real monster of a PR. However, I did it in stages, where each commit is a standalone step that builds and passes tests. This also makes it easier to rebase.

It might be easiest to review the final sql/types package first, to see where it all ends up. Then, you might review commit by commit to see each step towards the final product.

RaduBerinde · 2019-04-08T01:55:09Z

This is fantastic. Thank you for doing this!

andy-kimball · 2019-04-08T15:29:19Z

Anyone have suggestions about resolving a golint error:

pkg/sql/types/types.go:286:2: var Uuid should be UUID

I'm already using UUID to signify the name of the SemanticType. All other types follow this convention of types.INT (SemanticType) vs. types.Int (pre-constructed instance). I can't find a way to disable lint for this one line, either.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)

pkg/sql/pgwire/types.go, line 73 at r15 (raw file):

// mapResultOid performs any necessary mapping of Oid values to be returned to
// the client. For now, this only includes mapping T_float4 => T_float8.
func mapResultOid(o oid.Oid) oid.Oid {

@knz, I wasn't sure what to do about this issue (I believe it's a bug in libpq). I ended up mapping from float4 => float8 to work around it. But it raises larger issues about whether it's a good idea to begin returning the true OID values for selected columns. Should we map all OIDs to "canonical" OIDs (e.g. int2 => int8, varchar => string, etc) in order to match previous version behavior?

P.S. The PG client does not exhibit this bad float4 behavior when pointed at CRDB. It's a libpq thing.

andy-kimball · 2019-04-08T15:57:23Z

Another suggestion for reviewers: compare all changes made to test files (e.g. information_schema and pg_catalog), since that will tend to show differences in behavior as a result of this PR.

RaduBerinde · 2019-04-08T15:57:46Z

Anyone have suggestions about resolving a golint error:

Just add an exception here: https://github.com/cockroachdb/cockroach/blob/master/pkg/testutils/lint/lint_test.go#L1451
Something like
stream.GrepNot("pkg/sql/types/types.go.* var Uuid should be UUID")
so it's only for this file.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)

pkg/sql/logictest/testdata/logic_test/array, line 212 at r15 (raw file):

{1,2,3}

query T

This new behavior matches PG.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)

pkg/sql/parser/parse.go, line 374 at r17 (raw file):

}

// The SERIAL types are pseudo-types that are only used during parsing. After

@knz, I wanted to call this out. I'm treating SERIAL differently now by not creating a "real" type for it. Instead, it's just different instances of the INT types that I then use == reference comparison to recognize. All "memory" of these types is gone after the analysis phase is complete; they exist only for pretty-printing the parse tree.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)

pkg/sql/pgwire/types.go, line 73 at r15 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

@knz, I wasn't sure what to do about this issue (I believe it's a bug in libpq). I ended up mapping from float4 => float8 to work around it. But it raises larger issues about whether it's a good idea to begin returning the true OID values for selected columns. Should we map all OIDs to "canonical" OIDs (e.g. int2 => int8, varchar => string, etc) in order to match previous version behavior?

P.S. The PG client does not exhibit this bad float4 behavior when pointed at CRDB. It's a libpq thing.

As another example, not mapping to previous OIDs results in this Java acceptance test failing:

    public void testArrayWithProps() throws Exception {
        PreparedStatement stmt = conn.prepareStatement("CREATE TABLE x (a SMALLINT[])");
        stmt.execute();
        stmt = conn.prepareStatement("INSERT INTO x VALUES (ARRAY[123])");
        stmt.execute();
        stmt = conn.prepareStatement("SELECT a FROM x");
        ResultSet rs = stmt.executeQuery();
        rs.next();

        Array ar = rs.getArray(1);
        Long[] fs = (Long[]) ar.getArray();
        Assert.assertArrayEquals(new Long[]{123L}, fs);
    }

with the error: incompatible types: long cannot be converted to java.lang.Integer
since oid.T__int2 is returned rather than oid.T__int8, as before.

ColumnType is a rather big struct, and it should almost never be passed by value. Update most usages to pass by ref rather than by value. Release note: None

Convert all uses of the existing types.T interface and structs to instead use an alias type: type T = ColumnType In a future commit, ColumnType will be converted to T so that the alias is no longer necessary. This change requires all current usages of types.T to be converted to *types.T, since types.T is now a struct instead of an interface. Release note: None

Update the String() method on types.T to show a type name that is closer to what coltypes.T shows. For example, varchar and bit are currently never returned by types.T, which instead maps them to string and varbit, respectively. This commit prepares for unification of types.T and coltypes.T. Release note: None

Directly create types.T instances in the SQL parser and other places where coltypes.T is in use (e.g. cast, annotate exprs). Doing this requires: 1. Support for SERIAL pseudo-types to types.T 2. Reconciliation of type name string algorithms This commit also removes the coltypes package, since it is no longer in use. In addition, this commit reworks some of the random type generation needed to get various tests fully working after these typing changes. Release note: None

It's confusing that the type is called NULL, even though PG calls it UNKNOWN and it prints as UNKNOWN. Change it to be consistent. Release note: None

Remove the "type T = ColumnType" alias by renaming ColumnType to T. Now there is just one way to refer to SQL types in the system: types.T, which should reduce confusion. Also delete VisibleType, since it's no longer used, except for mapping to/from the legacy ColumnType representation. Release note: None

Do not export fields from InternalType as fields on types.T. Instead, corresponding methods. This enables tighter control over how the fields are set. In particular, the Unmarshal method and the pre-constructed types can now guarantee that the Oid, Width, and Locale fields are properly set. This better encapsulation makes it more difficult for callers to accidentally construct invalid types, which is a real concern, given all the subtle rules and usages of various Type objects across multiple releases. This change means that every types.T constructor now needs to call a constructor method or use a pre-constructed type. Release note: None

Put constants and methods in a more intuitive order, for easier comprehension. Only code is moved; this commit makes no functional changes, and so does not need to be carefully reviewed. Release note: None

Rearrange and clean up methods in the package, including moving some functionality from the package to other locations (e.g. information schema related methods). Add comments to all exported structs and methods in the types package. Add tests for all functionality in the package. Fix compatibility issue in pg_type table, where oidvector and int2vector had array oid types of 0. Release note (sql change): Display correct array type oids for oidvector and int2vector types in pg_type table, for better PG compatibility.

Now that the type system is unified, put the new types package in a more discoverable, intuitive location. Release note: None

Update terminology from SemanticType to Family, per conversation on PR cockroachdb#36699. The rename follows this pattern: INT -> IntFamily STRING -> StringFamily ... In addition, update all the places that were using the name of the family as if it were the name of the type, and update them to use one of the other type name methods instead. Add a PGName() method to round out the set of type name methods that we use in various places in the code. Release note: None

When a SQL projection column has no alias, a name is generated for it. If the expression is a CAST expression, the name is derived from the data type of the cast. This commit uses the PG-compatible name rather than the CRDB name for the data type, which slightly increases PG-compat. Release note (sql change): Use the PG-compatible name when auto-generating a column alias for a CAST expression.

Currently, locale strings are quoted when they contain underscore or dash characters. The Unicode CLDR standard treats those characters as equivalent. Furthermore, SQL identifiers don't need to be quoted. So this commit introduces a new EncodeLocaleName method that maps dash characters to underscore characters so that locale names do not need to be quoted when printed as part of SQL syntax. This change resolves a compatibility issue with the TypeORM driver, which does not handle quoted locale names. Release note (sql change): Locale names in COLLATE clauses have dash characters mapped to underscore when printed as part of SQL syntax.

andy-kimball · 2019-04-19T20:51:20Z

bors r+

36598: sql: Unify the SQL type system r=andy-kimball a=andy-kimball This PR unifies the three different SQL type systems in CRDB: `sqlbase.ColumnType` - protobuf representation of table column types `coltypes.T` - in-memory representation of table column types `types.T` - computation types The type systems are quite similar, but also subtly different. This causes bugs on a regular basis throughout the codebase. Each type system supports a different subset of the PG type functionality. For example, `sqlbase.ColumnType` stores column widths. `types.T` does not store widths, but it can store a Postgres OID value, which `sqlbase.ColumnType` is unable to do. In addition, even for supported functionality, the API's are often tricky and error-prone. For example, `types.T` values can sometimes be compared using `==`, but other times `Equivalent` must be called, and it is up to the user to know when to use which. As another example, support for OID values require wrapping `types.T` instances in a special `TOIDWrapper` type. Our developers are expected to know when it needs to be done, but typically they either don't understand when it's needed, or else forget to do it. It also breaks `==` comparison semantics, leading to subtle bugs that only manifest in edge cases with unusual OID values. Another big problem that comes up repeatedly are lossy conversions. Throughout the codebase, there are frequent conversions to/from the various type systems. Many conversions discard important type information, which often leads to incompatibilities with Postgres, as well as providing a breeding ground for bugs. This PR unifies the three type systems into just one. The unified type is called `types.T` (in new `sql/types` package). It uses a protobuf-generated struct that is backwards-compatible with `sqlbase.ColumnType` so that it can be embedded in place of `ColumnType` in larger protobuf messages (like `sqlbase.ColumnDescriptor`). The API is locked down so that it's difficult to ignorantly or accidentally create invalid types. The unified type supports everything that the three previous type systems did, and is more compatible with PG as a result (with more work, even more type compatibility is possible). Resolves #32639 36681: distsqlplan: reset MergeOrdering if windower is planned r=yuzefovich a=yuzefovich Windower doesn't guarantee maintaining the order, so merge ordering of the plan should be set to empty. Fixes: #35179. Release note: None 36962: storage/engine: fix handling of 0-length varstrings in RocksDBBatchReader r=tbg a=petermattis Previously, 0-length varstrings inadvertently cuased the reader to truncate the batch repr which would usually result in a call to `Next` complaining about the batch containing an unexpected number of entries. So far, it looks like the only effect of this would be "invalid batch" errors while using `cockroach debug` commands. It is possible there is a more serious effect, though. See #36937 Release note (bug fix): Fixed a bug in write batch decoding that could cause "invalid batch" errors while using `cockroach debug` commands to analyze data. Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Peter Mattis <[email protected]>

craig · 2019-04-19T21:26:50Z

Build succeeded

GitHub CI (Cockroach)

With the updated unified types from cockroachdb#36598, a number of tests are now passing. Each test has already been vetted. Fixes cockroachdb#36971 Release note: None

37002: roachtest: update hibernate blacklist v19.2 r=BramGruneir a=BramGruneir With the updated unified types from #36598, a number of tests are now passing. Each test has already been vetted. Fixes #36971 Release note: None Co-authored-by: Bram Gruneir <[email protected]>

andy-kimball requested review from jordanlewis, knz and bobvawter April 7, 2019 13:18

andy-kimball requested a review from a team as a code owner April 7, 2019 13:18

andy-kimball requested review from a team April 7, 2019 13:18

andy-kimball requested a review from a team as a code owner April 7, 2019 13:18

andy-kimball requested review from a team April 7, 2019 13:18

andy-kimball requested review from BramGruneir and removed request for a team April 8, 2019 04:59

andy-kimball commented Apr 8, 2019

View reviewed changes

andy-kimball force-pushed the types branch 2 times, most recently from c97cd19 to cc215c0 Compare April 8, 2019 16:53

andy-kimball commented Apr 8, 2019

View reviewed changes

andy-kimball force-pushed the types branch from cc215c0 to 12f56f7 Compare April 8, 2019 17:24

andy-kimball added 6 commits April 19, 2019 06:37

sql: Fix cases where ColumnType is unnecessarily passed by value

03a9de0

ColumnType is a rather big struct, and it should almost never be passed by value. Update most usages to pass by ref rather than by value. Release note: None

sql: Rename SemanticType NULL to UNKNOWN

f1f0806

It's confusing that the type is called NULL, even though PG calls it UNKNOWN and it prints as UNKNOWN. Change it to be consistent. Release note: None

andy-kimball mentioned this pull request Apr 19, 2019

sql: Rename SemanticType to Family (patch 16/) #36957

Closed

andy-kimball added 7 commits April 19, 2019 12:12

sql: Reorder constants and methods in types.go

9e9caa4

Put constants and methods in a more intuitive order, for easier comprehension. Only code is moved; this commit makes no functional changes, and so does not need to be carefully reviewed. Release note: None

sql: Move sql/sem/types package to sql/types

8e16dda

Now that the type system is unified, put the new types package in a more discoverable, intuitive location. Release note: None

andy-kimball force-pushed the types branch from 12f56f7 to 7365a07 Compare April 19, 2019 20:11

craig bot merged commit 7365a07 into cockroachdb:master Apr 19, 2019

andy-kimball deleted the types branch April 19, 2019 21:48

BramGruneir mentioned this pull request Apr 22, 2019

roachtest: update hibernate blacklist v19.2 #37002

Merged

rafiss mentioned this pull request May 28, 2019

release-19.1: tree: handle null in suboperator expressions and improve empty array cast #37897

Closed

This was referenced Jun 6, 2019

sql: Unify the SQL type system cockroachdb/docs#4883

Closed

sql: Unify the SQL type system cockroachdb/docs#4884

Closed

rafiss mentioned this pull request Jul 27, 2020

Regression starting from 19.2 with Elixir / binary protocol when loading arrays #51360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: Unify the SQL type system #36598

sql: Unify the SQL type system #36598

andy-kimball commented Apr 7, 2019

cockroach-teamcity commented Apr 7, 2019

andy-kimball commented Apr 7, 2019

RaduBerinde commented Apr 8, 2019

andy-kimball commented Apr 8, 2019

andy-kimball left a comment

andy-kimball commented Apr 8, 2019

RaduBerinde commented Apr 8, 2019

andy-kimball left a comment

andy-kimball left a comment

andy-kimball left a comment

andy-kimball commented Apr 19, 2019

craig bot commented Apr 19, 2019

sql: Unify the SQL type system #36598

sql: Unify the SQL type system #36598

Conversation

andy-kimball commented Apr 7, 2019

cockroach-teamcity commented Apr 7, 2019

andy-kimball commented Apr 7, 2019

RaduBerinde commented Apr 8, 2019

andy-kimball commented Apr 8, 2019

andy-kimball left a comment

Choose a reason for hiding this comment

andy-kimball commented Apr 8, 2019

RaduBerinde commented Apr 8, 2019

andy-kimball left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

andy-kimball commented Apr 19, 2019

craig bot commented Apr 19, 2019

Build succeeded