Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Unify the SQL type system #36598

Merged
merged 18 commits into from
Apr 19, 2019
Merged

sql: Unify the SQL type system #36598

merged 18 commits into from
Apr 19, 2019

Conversation

andy-kimball
Copy link
Contributor

This PR unifies the three different SQL type systems in CRDB:

sqlbase.ColumnType - protobuf representation of table column types
coltypes.T - in-memory representation of table column types
types.T - computation types

The type systems are quite similar, but also subtly different. This causes bugs on a regular basis throughout the codebase. Each type system supports a different subset of the PG type functionality. For example, sqlbase.ColumnType stores column widths. types.T does not store widths, but it can store a Postgres OID value, which sqlbase.ColumnType is unable to do.

In addition, even for supported functionality, the API's are often tricky and error-prone. For example, types.T values can sometimes be compared using ==, but other times Equivalent must be called, and it is up to the user to know when to use which. As another example, support for OID values require wrapping types.T instances in a special TOIDWrapper type. Our developers are expected to know when it needs to be done, but typically they either don't understand when it's needed, or else forget to do it. It also breaks == comparison semantics, leading to subtle bugs that only manifest in edge cases with unusual OID values.

Another big problem that comes up repeatedly are lossy conversions. Throughout the codebase, there are frequent conversions to/from the various type systems. Many conversions discard important type information, which often leads to incompatibilities with Postgres, as well as providing a breeding ground for bugs.

This PR unifies the three type systems into just one. The unified type is called types.T (in new sql/types package). It uses a protobuf-generated struct that is backwards-compatible with sqlbase.ColumnType so that it can be embedded in place of ColumnType in larger protobuf messages (like sqlbase.ColumnDescriptor). The API is locked down so that it's difficult to ignorantly or accidentally create invalid types. The unified type supports everything that the three previous type systems did, and is more compatible with PG as a result (with more work, even more type compatibility is possible).

Resolves #32639

@andy-kimball andy-kimball requested a review from a team as a code owner April 7, 2019 13:18
@andy-kimball andy-kimball requested review from a team April 7, 2019 13:18
@andy-kimball andy-kimball requested a review from a team as a code owner April 7, 2019 13:18
@andy-kimball andy-kimball requested review from a team April 7, 2019 13:18
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andy-kimball
Copy link
Contributor Author

Note to reviewers:
I know this is a real monster of a PR. However, I did it in stages, where each commit is a standalone step that builds and passes tests. This also makes it easier to rebase.

It might be easiest to review the final sql/types package first, to see where it all ends up. Then, you might review commit by commit to see each step towards the final product.

@RaduBerinde
Copy link
Member

This is fantastic. Thank you for doing this!

@andy-kimball andy-kimball requested review from BramGruneir and removed request for a team April 8, 2019 04:59
@andy-kimball
Copy link
Contributor Author

Anyone have suggestions about resolving a golint error:

pkg/sql/types/types.go:286:2: var Uuid should be UUID

I'm already using UUID to signify the name of the SemanticType. All other types follow this convention of types.INT (SemanticType) vs. types.Int (pre-constructed instance). I can't find a way to disable lint for this one line, either.

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)


pkg/sql/pgwire/types.go, line 73 at r15 (raw file):

// mapResultOid performs any necessary mapping of Oid values to be returned to
// the client. For now, this only includes mapping T_float4 => T_float8.
func mapResultOid(o oid.Oid) oid.Oid {

@knz, I wasn't sure what to do about this issue (I believe it's a bug in libpq). I ended up mapping from float4 => float8 to work around it. But it raises larger issues about whether it's a good idea to begin returning the true OID values for selected columns. Should we map all OIDs to "canonical" OIDs (e.g. int2 => int8, varchar => string, etc) in order to match previous version behavior?

P.S. The PG client does not exhibit this bad float4 behavior when pointed at CRDB. It's a libpq thing.

@andy-kimball
Copy link
Contributor Author

Another suggestion for reviewers: compare all changes made to test files (e.g. information_schema and pg_catalog), since that will tend to show differences in behavior as a result of this PR.

@RaduBerinde
Copy link
Member

Anyone have suggestions about resolving a golint error:

Just add an exception here: https://github.com/cockroachdb/cockroach/blob/master/pkg/testutils/lint/lint_test.go#L1451
Something like
stream.GrepNot("pkg/sql/types/types.go.* var Uuid should be UUID")
so it's only for this file.

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)


pkg/sql/logictest/testdata/logic_test/array, line 212 at r15 (raw file):

{1,2,3}

query T

This new behavior matches PG.

@andy-kimball andy-kimball force-pushed the types branch 2 times, most recently from c97cd19 to cc215c0 Compare April 8, 2019 16:53
Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)


pkg/sql/parser/parse.go, line 374 at r17 (raw file):

}

// The SERIAL types are pseudo-types that are only used during parsing. After

@knz, I wanted to call this out. I'm treating SERIAL differently now by not creating a "real" type for it. Instead, it's just different instances of the INT types that I then use == reference comparison to recognize. All "memory" of these types is gone after the analysis phase is complete; they exist only for pretty-printing the parse tree.

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @BramGruneir, @jordanlewis, and @knz)


pkg/sql/pgwire/types.go, line 73 at r15 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

@knz, I wasn't sure what to do about this issue (I believe it's a bug in libpq). I ended up mapping from float4 => float8 to work around it. But it raises larger issues about whether it's a good idea to begin returning the true OID values for selected columns. Should we map all OIDs to "canonical" OIDs (e.g. int2 => int8, varchar => string, etc) in order to match previous version behavior?

P.S. The PG client does not exhibit this bad float4 behavior when pointed at CRDB. It's a libpq thing.

As another example, not mapping to previous OIDs results in this Java acceptance test failing:

    public void testArrayWithProps() throws Exception {
        PreparedStatement stmt = conn.prepareStatement("CREATE TABLE x (a SMALLINT[])");
        stmt.execute();
        stmt = conn.prepareStatement("INSERT INTO x VALUES (ARRAY[123])");
        stmt.execute();
        stmt = conn.prepareStatement("SELECT a FROM x");
        ResultSet rs = stmt.executeQuery();
        rs.next();

        Array ar = rs.getArray(1);
        Long[] fs = (Long[]) ar.getArray();
        Assert.assertArrayEquals(new Long[]{123L}, fs);
    }

with the error: incompatible types: long cannot be converted to java.lang.Integer
since oid.T__int2 is returned rather than oid.T__int8, as before.

ColumnType is a rather big struct, and it should almost never be passed by
value. Update most usages to pass by ref rather than by value.

Release note: None
Convert all uses of the existing types.T interface and structs to instead
use an alias type:

  type T = ColumnType

In a future commit, ColumnType will be converted to T so that the alias is
no longer necessary.

This change requires all current usages of types.T to be converted to *types.T,
since types.T is now a struct instead of an interface.

Release note: None
Update the String() method on types.T to show a type name that is closer to what
coltypes.T shows. For example, varchar and bit are currently never returned by
types.T, which instead maps them to string and varbit, respectively.

This commit prepares for unification of types.T and coltypes.T.

Release note: None
Directly create types.T instances in the SQL parser and other places where
coltypes.T is in use (e.g. cast, annotate exprs). Doing this requires:

  1. Support for SERIAL pseudo-types to types.T
  2. Reconciliation of type name string algorithms

This commit also removes the coltypes package, since it is no longer in use.

In addition, this commit reworks some of the random type generation needed to
get various tests fully working after these typing changes.

Release note: None
It's confusing that the type is called NULL, even though PG calls it
UNKNOWN and it prints as UNKNOWN. Change it to be consistent.

Release note: None
Remove the "type T = ColumnType" alias by renaming ColumnType to T. Now
there is just one way to refer to SQL types in the system: types.T, which
should reduce confusion.

Also delete VisibleType, since it's no longer used, except for mapping
to/from the legacy ColumnType representation.

Release note: None
Do not export fields from InternalType as fields on types.T. Instead,
corresponding methods. This enables tighter control over how the fields
are set. In particular, the Unmarshal method and the pre-constructed types
can now guarantee that the Oid, Width, and Locale fields are properly
set. This better encapsulation makes it more difficult for callers to
accidentally construct invalid types, which is a real concern, given all
the subtle rules and usages of various Type objects across multiple
releases.

This change means that every types.T constructor now needs to call a
constructor method or use a pre-constructed type.

Release note: None
Put constants and methods in a more intuitive order, for easier
comprehension.

Only code is moved; this commit makes no functional changes, and so
does not need to be carefully reviewed.

Release note: None
Rearrange and clean up methods in the package, including moving some
functionality from the package to other locations (e.g. information
schema related methods). Add comments to all exported structs and methods
in the types package. Add tests for all functionality in the package.

Fix compatibility issue in pg_type table, where oidvector and int2vector
had array oid types of 0.

Release note (sql change): Display correct array type oids for oidvector
and int2vector types in pg_type table, for better PG compatibility.
Now that the type system is unified, put the new types package in a
more discoverable, intuitive location.

Release note: None
Update terminology from SemanticType to Family, per conversation on
PR cockroachdb#36699. The rename follows this pattern:

  INT -> IntFamily
  STRING -> StringFamily
  ...

In addition, update all the places that were using the name of the family
as if it were the name of the type, and update them to use one of the
other type name methods instead. Add a PGName() method to round out the
set of type name methods that we use in various places in the code.

Release note: None
When a SQL projection column has no alias, a name is generated for it. If
the expression is a CAST expression, the name is derived from the data type
of the cast. This commit uses the PG-compatible name rather than the CRDB
name for the data type, which slightly increases PG-compat.

Release note (sql change): Use the PG-compatible name when auto-generating a
column alias for a CAST expression.
Currently, locale strings are quoted when they contain underscore or
dash characters. The Unicode CLDR standard treats those characters as
equivalent. Furthermore, SQL identifiers don't need to be quoted. So
this commit introduces a new EncodeLocaleName method that maps dash
characters to underscore characters so that locale names do not need
to be quoted when printed as part of SQL syntax.

This change resolves a compatibility issue with the TypeORM driver,
which does not handle quoted locale names.

Release note (sql change): Locale names in COLLATE clauses have dash
characters mapped to underscore when printed as part of SQL syntax.
@andy-kimball
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Apr 19, 2019
36598: sql: Unify the SQL type system r=andy-kimball a=andy-kimball

This PR unifies the three different SQL type systems in CRDB:

  `sqlbase.ColumnType` - protobuf representation of table column types
  `coltypes.T` - in-memory representation of table column types
  `types.T` - computation types

The type systems are quite similar, but also subtly different. This causes bugs on a regular basis throughout the codebase. Each type system supports a different subset of the PG type functionality. For example, `sqlbase.ColumnType` stores column widths. `types.T` does not store widths, but it can store a Postgres OID value, which `sqlbase.ColumnType` is unable to do.

In addition, even for supported functionality, the API's are often tricky and error-prone. For example, `types.T` values can sometimes be compared using `==`, but other times `Equivalent` must be called, and it is up to the user to know when to use which. As another example, support for OID values require wrapping `types.T` instances in a special `TOIDWrapper` type. Our developers are expected to know when it needs to be done, but typically they either don't understand when it's needed, or else forget to do it. It also breaks `==` comparison semantics, leading to subtle bugs that only manifest in edge cases with unusual OID values.

Another big problem that comes up repeatedly are lossy conversions. Throughout the codebase, there are frequent conversions to/from the various type systems. Many conversions discard important type information, which often leads to incompatibilities with Postgres, as well as providing a breeding ground for bugs.

This PR unifies the three type systems into just one. The unified type is called `types.T` (in new `sql/types` package). It uses a protobuf-generated struct that is backwards-compatible with `sqlbase.ColumnType` so that it can be embedded in place of `ColumnType` in larger protobuf messages (like `sqlbase.ColumnDescriptor`). The API is locked down so that it's difficult to ignorantly or accidentally create invalid types. The unified type supports everything that the three previous type systems did, and is more compatible with PG as a result (with more work, even more type compatibility is possible).

Resolves #32639

36681: distsqlplan: reset MergeOrdering if windower is planned r=yuzefovich a=yuzefovich

Windower doesn't guarantee maintaining the order, so merge ordering
of the plan should be set to empty.

Fixes: #35179.

Release note: None

36962: storage/engine: fix handling of 0-length varstrings in RocksDBBatchReader r=tbg a=petermattis

Previously, 0-length varstrings inadvertently cuased the reader to
truncate the batch repr which would usually result in a call to `Next`
complaining about the batch containing an unexpected number of
entries. So far, it looks like the only effect of this would be "invalid
batch" errors while using `cockroach debug` commands. It is possible
there is a more serious effect, though.

See #36937

Release note (bug fix): Fixed a bug in write batch decoding that could
cause "invalid batch" errors while using `cockroach debug` commands to
analyze data.

Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Peter Mattis <[email protected]>
@craig
Copy link
Contributor

craig bot commented Apr 19, 2019

Build succeeded

@craig craig bot merged commit 7365a07 into cockroachdb:master Apr 19, 2019
@andy-kimball andy-kimball deleted the types branch April 19, 2019 21:48
BramGruneir added a commit to BramGruneir/cockroach that referenced this pull request Apr 23, 2019
With the updated unified types from cockroachdb#36598, a number of tests are now passing.
Each test has already been vetted.

Fixes cockroachdb#36971

Release note: None
craig bot pushed a commit that referenced this pull request Apr 23, 2019
37002: roachtest: update hibernate blacklist v19.2 r=BramGruneir a=BramGruneir

With the updated unified types from #36598, a number of tests are now passing.
Each test has already been vetted.

Fixes #36971

Release note: None

Co-authored-by: Bram Gruneir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: Grand Unified Type System
4 participants