-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: support unique_rowid()
as default expression for IMPORT INTO
#50922
Conversation
114a25d
to
4e7f8b4
Compare
unique_rowid()
as default expressionunique_rowid()
as default expression for IMPORT INTO
pkg/sql/row/row_converter.go
Outdated
if err != nil { | ||
return errors.Wrapf(err, "error evaluating default expression for IMPORT INTO") | ||
if col.DefaultExprStr() == "unique_rowid()" { | ||
if c.hidden < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite right? Not all tables into which we'll use IMPORT INTO
with unique_rowid()
will also be using hidding PK cols, right?
also, I think this would currently be putting the same value in the hidden pk col and in the explicit default col, but that seems wrong -- each call to unique_rowid is supposed to return a different value, even if they're in the same row. Hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for catching that. Turns out I omitted the case where Primary Key
is specified in the column specification.
For the second part of your comment, this would be a problem when we have more than a column with default expression being unique_rowid()
right? Would it make sense to add a constraint that we only support if there's <= 1 column with unique_rowid
?
Edit: just realized that it might be possible to get around by changing (uint64(sourceID) << rowIDBits) ^ uint64(rowIndex)
to (uint64(sourceID) << rowIDBits) ^ uint64(c * rowIndex + r)
with c
being the number of columns with unique_rowid()
(possibly including the hidden column) and r=0, 1, ..., c-1
being the actual column positions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I thought I left a comment here... but I guess GitHub swallowed it?
Changing the method to (uint64(sourceID) << rowIDBits) ^ uint64(c * rowIndex + r)
would work, but since the rows in a CSV import also add the timestamp to their rowID, it would also increase the rate at which we consume time. Right now, every row consumes 10 microseconds of time - with the modified expression, it would consume c*10 microseconds. Extrapolating from the comment in the code:
100m row file would use 10µs/row or ~17min worth of IDs
Then importing a (10 column x 100m) row file would use nearly 3 hours of IDs. That seems quite expensive... Perhaps we can keep track of only the columns which have a default value of unique_rowid
. This would reduce the rate at which we consume keys to u*10 microseconds/row, where u
is the number of columns with a uniquie_rowid
default value.
But perhaps limiting 1 default unique_rowid
per import is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this concern makes sense. Maybe roll back to supporting only 1 default column with unique_rowid
until we have a better scheme for avoiding collision?
Edit: the updated implementation has the value c
being the number of columns with unique_rowid()
, including the hidden column but not visible and targeted columns (these columns should use the values from the CSVs instead). As @dt mentioned (below) this should work for wide columns that has (at most) two unique_rowid()
columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have a slight preference for counting how many we need as previously suggested, instead of total columns, since a very wide table is pretty common but a wide table of uniquely-generated-otherwise-meaningless integers is not. In practice, I'd guess it is one most often (pk/hidden pk) but wouldn't be surprised by two: implicit pk and then explicit unique rowid (where they probably should have used that as an explicit PK instead but didn't)
unique_rowid()
as default expression for IMPORT INTOunique_rowid()
as default expression for IMPORT INTO
unique_rowid()
as default expression for IMPORT INTOunique_rowid()
as default expression for IMPORT INTO
unique_rowid()
as default expression for IMPORT INTOunique_rowid()
as default expression for IMPORT INTO
7447d41
to
e215947
Compare
Added TODO to address the potential problem with IDs with multiple |
Can you remove the SQL comments from https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/testdata/pgdump/geo.sql too? |
(Add on to the problem below: it seems like the target columns are not properly populated during the Just found a problem: right now the implementation of import for pgdump seemed to assume that the statements in E.g. the test file has Consider, now, changing the column declaration to (basically change the last two) |
@mjibson I'm on my way in #51065 to address the problem of |
Do whatever makes sense to you. If separate PRs is easier, do that. |
Just did that after closing #51065 (along with modification on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @miretskiy, and @pbardea)
pkg/ccl/importccl/import_stmt_test.go, line 68 at r2 (raw file):
// This checks that the selected columns of a query string have // all unique elements. It's useful for checking unique_rowid. func checkUnique(allStr [][]string, inds []int) bool {
Can we rename the first argument to something like rows
or data
and the second to colIndexes
. It's not immediately clear to me what these are supposed to be without looking at the callsite.
pkg/ccl/importccl/import_stmt_test.go, line 85 at r3 (raw file):
// This checks that the selected columns of a query string have // no "NULL" elements. It's useful for checking unique_rowid. func checkNoNull(allStr [][]string, inds []int) bool {
nit: I think containsNoNull
may be a better name for this since it doesn't do any actual checking.
pkg/sql/row/row_converter.go, line 201 at r3 (raw file):
// The rest of these are derived from tableDesc, just cached here. rowIDs []int
This name seems a bit too generic for what it really is. Perhaps defaultUniqueCols
would work?
pkg/sql/row/row_converter.go, line 346 at r3 (raw file):
// if necessary. func (c *DatumRowConverter) Row(ctx context.Context, sourceID int32, rowIndex int64) error { for i, pos := range c.rowIDs {
I'd prefer pos
to be renamed to something like colIdx
.
pkg/sql/row/row_converter.go, line 347 at r3 (raw file):
func (c *DatumRowConverter) Row(ctx context.Context, sourceID int32, rowIndex int64) error { for i, pos := range c.rowIDs { // We don't want to call unique_rowid() for columns with such default expressions because it
nit: I think that this comment needs wrapping again after updating. The lower bits here aren't strictly the row IDs since they now also depend on the number of non-targeted rows with a unique_rowid
default val.
pkg/sql/row/row_converter.go, line 352 at r3 (raw file):
// low-bits. Instead, make our own IDs that attempt to keep each generator // (sourceID) writing to its own key-space with sequential rowIndexes // mapping to sequential unique IDs, by putting the rowID in the lower
This comment probably also needs updates, since the lower bits are no longer strictly just the
pkg/sql/row/row_converter.go, line 378 at r3 (raw file):
// too close to each other. It will therefore be nice to fix this problem. avoidCollisionsWithSQLsIDs := uint64(1 << 63) shiftedRowIndex := int64(len(c.rowIDs))*rowIndex + int64(i)
A comment explaining this change would be useful.
pkg/sql/row/row_converter.go, line 390 at r3 (raw file):
col := &c.cols[i] if !isTargetCol(i) && !col.Hidden && col.DefaultExpr != nil { if !(*col.DefaultExpr == "unique_rowid()" || tree.IsConst(c.EvalCtx, c.defaultExprs[i])) {
Could be bring this out to a helpers that determines if the expr is supported by IMPORT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96 and @miretskiy)
pkg/sql/row/row_converter.go, line 113 at r8 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
One more tiny readability request:
Let's add
useDefaultBuiltin *customFunc
Use this constant instead of
nil
here, and when walking below...
I meant:
var useDefaultBuiltin *customFunc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @miretskiy, and @pbardea)
pkg/ccl/importccl/import_stmt_test.go, line 3268 at r7 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Can we add this as a comment in the test?
Done.
pkg/ccl/importccl/import_stmt_test.go, line 4439 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
what's a gid?
It's a column name of serial type (which basically means unique_rowid()
in the pgdump file geo.sql
. But anyways, this comment has been removed since we no longer check the validity of gid (as unique_rowid
) here (only the comparison of the other corresponding columns, like what the test initially had before this change), because this check has been handled in the TestImportDefault
test.
pkg/ccl/importccl/import_stmt_test.go, line 4433 at r8 (raw file):
Previously, pbardea (Paul Bardea) wrote…
nit: "due to that rowid shows up"/"due to rowid showing up"
Done.
pkg/ccl/importccl/read_import_base.go, line 411 at r9 (raw file):
) (*row.DatumRowConverter, error) { conv, err := row.NewDatumRowConverter( ctx, importCtx.tableDesc, importCtx.targetCols, importCtx.evalCtx, importCtx.kvCh)
Note: copy removed from here.
pkg/ccl/importccl/read_import_pgdump.go, line 511 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Yeah -- sorry -- that's exaclty what I meant -- we should make a copy in new datum converter; and cleanup other callers (like read_import_base)
Done. This means:
- Creating a copy in the new datum converter;
- Remove the extra copy from
read_import_base.go
.
pkg/sql/row/row_converter.go, line 42 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
nit: let's not abbreviate? cellInfoAnnotation is just a bit longer.
Done.
pkg/sql/row/row_converter.go, line 45 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
nit: uniqueRowIDInstance?
Done.
pkg/sql/row/row_converter.go, line 85 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Do we need to dereference here? Looks like Annotations has a pointer receive;
I think you should be able to just say evalCtx.Annotations.Get()...?nit: Also, consider adding a small helper for readability (getCellInfoAnnotation):
Done (and yes we don't have to deference, thanks for catching).
pkg/sql/row/row_converter.go, line 124 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
c := annot.Get(cellInfoAddr).(cellInfoAnnot) c.uniqueRowIDTotal++ annot.Set(cellInfoAddr, c)
It would be better if you added a helper function (getCellInfoAnnotation) as I suggest above,
and, instead of storing a struct in the annotation, store the struct pointer (and have the helper return the pointer instead.So, this code could become:
getCellInfoAnnotation(annot).uniqueRowIDTotal++
Done.
pkg/sql/row/row_converter.go, line 185 at r7 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Can we also move this comment inside the if statement?
Done.
pkg/sql/row/row_converter.go, line 194 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I would probably make side effect optional (if sideEffect != nil )....?
Also, do you think it would make sense to rename it to
"visitorSideEffect" to make it explicit that this is accessed during visit only?
Done. (For both)
pkg/sql/row/row_converter.go, line 216 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Lets add
type overrideVolatility bool const overrideImmutable overrideVolatility = false const overrideVolatile overrideVolatility = true
And use those constants instead of true/false in this method.
Looks like if we used the three lines and do v.volatile = overrideVolatile
the compiler will give an error "cannot use overrideVolatile
(type overrideVolatility
) as bool, so I've changed that to
const overrideImmutable = false
const overrideVolatile = true
pkg/sql/row/row_converter.go, line 587 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
note: you won't need to Set again if
getCellInfoAnnotation()
helper returns *cellInfoAnnot.Also, let's add a method to the struct:
func (c *cellInfoAnnot) Reset(source, row) {...} getCellInfoAnnotation(c.EvalCtx).Reset(sourceID, rowIndex)
Done.
pkg/sql/row/row_converter.go, line 113 at r8 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I meant:
var useDefaultBuiltin *customFunc
Done.
pkg/sql/row/row_converter.go, line 163 at r8 (raw file):
Previously, pbardea (Paul Bardea) wrote…
nit: let's also rename this
annotations
Done.
pkg/sql/row/row_converter.go, line 165 at r8 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Can we also add a flag here explaining when we expect this flag to be set?
Done.
pkg/sql/row/row_converter.go, line 469 at r9 (raw file):
tableDesc: immutDesc, KvCh: kvCh, EvalCtx: evalCtx.Copy(),
Note: copy added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @miretskiy, and @pbardea)
pkg/sql/row/row_converter.go, line 531 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
if col.Hidden { if col.DefaultExpr == nil || *col.DefaultExpr != "unique_rowid()" || hidden != -1 { return nil, errors.New("unexpected hidden column") } hidden = i }
Do we still need this branch, now that we know how to evaluation unique row id?
This branch is here even before the support for default expressions started. I think the purpose is to check that an import schema has at most one hidden column, and that the hidden column (if any) must have unique_rowid()
as default expression. I'm not sure what was the original intention of this part of the code -- let me see if I can do git blame for backtracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
pkg/sql/row/row_converter.go, line 531 at r7 (raw file):
Previously, Anzoteh96 (Anzo Teh) wrote…
This branch is here even before the support for default expressions started. I think the purpose is to check that an import schema has at most one hidden column, and that the hidden column (if any) must have
unique_rowid()
as default expression. I'm not sure what was the original intention of this part of the code -- let me see if I can do git blame for backtracking.
So I noticed that the earliest time when this was implemented (later versions were mainly refactoring) was at #22542 when hidden column with primary key was first added for support, so I guess it made sense to only restrict to one hidden column for defensive checking. My take is that this isn't really necessary now: @dt @mjibson what are your thoughts?
TL;DR previously we had the condition check where each table can have at most one hidden column, and that this hidden column must have unique_rowid
as default expression. Is this check still necessary, now that we have unique_rowid
supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
pkg/sql/row/row_converter.go, line 216 at r7 (raw file):
Previously, Anzoteh96 (Anzo Teh) wrote…
Looks like if we used the three lines and do
v.volatile = overrideVolatile
the compiler will give an error "cannot useoverrideVolatile
(typeoverrideVolatility
) as bool, so I've changed that toconst overrideImmutable = false const overrideVolatile = true
It sounds like you need to update the type of v.volatile
to be overrideVolatility
instead of bool. We'd also probably need to change the !volatile
check to volatility == overrideImmutable
.
pkg/sql/row/row_converter.go, line 603 at r10 (raw file):
return ok } getCellInfoAnnotation(c.EvalCtx.Annotations).Reset(sourceID, rowIndex)
I quite like where this ended up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few super minor nits, but it's from me. Perhaps wait for @pbardea and @dt to comment as well.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
pkg/sql/row/row_converter.go, line 216 at r7 (raw file):
Previously, pbardea (Paul Bardea) wrote…
It sounds like you need to update the type of
v.volatile
to beoverrideVolatility
instead of bool. We'd also probably need to change the!volatile
check tovolatility == overrideImmutable
.
Yes, that would be nice. "if volatility == overrideImmutable" is more readable than if !volatile
pkg/sql/row/row_converter.go, line 531 at r7 (raw file):
Previously, Anzoteh96 (Anzo Teh) wrote…
So I noticed that the earliest time when this was implemented (later versions were mainly refactoring) was at #22542 when hidden column with primary key was first added for support, so I guess it made sense to only restrict to one hidden column for defensive checking. My take is that this isn't really necessary now: @dt @mjibson what are your thoughts?
TL;DR previously we had the condition check where each table can have at most one hidden column, and that this hidden column must have
unique_rowid
as default expression. Is this check still necessary, now that we haveunique_rowid
supported?
I'm pretty sure we can remove this branch; but maybe @dt has a different take.
pkg/sql/row/row_converter.go, line 209 at r10 (raw file):
// relevant counter (e.g. the total number of occurrences of the // unique_rowid function in an expression). v.volatile = overrideVolatile
Perhaps rename volatile field to volatility
?
pkg/sql/row/row_converter.go, line 566 at r10 (raw file):
c.defaultCache[i] = typedExpr // This default expression isn't volatile, so we can evaluate once here // and memoize it.
nit: move the comment inside if statement (and wrap it if needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod small remaining threads
Reviewed 1 of 1 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
pkg/sql/row/row_converter.go, line 216 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Yes, that would be nice. "if volatility == overrideImmutable" is more readable than
if !volatile
Done. (Also, moved those to the list of constants declared in the file).
pkg/sql/row/row_converter.go, line 209 at r10 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Perhaps rename volatile field to
volatility
?
Done.
pkg/sql/row/row_converter.go, line 566 at r10 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
nit: move the comment inside if statement (and wrap it if needed)
Done.
pkg/sql/row/row_converter.go, line 48 at r12 (raw file):
overrideErrorTerm overrideVolatility = false overrideImmutable overrideVolatility = false overrideVolatile overrideVolatility = true
Note: the overrideVolatility
constants moved here (see my next comments below for the use of overrideErrorTerm
).
pkg/sql/row/row_converter.go, line 256 at r12 (raw file):
return nil, overrideErrorTerm, unsafeExpressionError(err, "type checking error", expr.String()) }
This SanitizeExprsForImport
function has overrideVolatility
as one of its returns, so something needs to be returned too even though this SanitizeExprsForImport
somehow produces an error. For this reason, the term overrideErrorTerm
is used, though it functions like a no-op (just like how the nil
typedExpr functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
pkg/sql/row/row_converter.go, line 238 at r13 (raw file):
// SanitizeExprsForImport checks whether default expressions are supported // for import. func SanitizeExprsForImport(
Last nit: I don't think this needs to be exported - this should also make CI happy.
…INTO The PR cockroachdb#50295 supports non-targeted columns with constant expression. This PR is a follow up to that in adding support to `unique_rowid()`. Previously, the only support given to rowid as a default expression is for hidden column, which is a function of timestamp, row number, and source ID (the ID of processor). To accommodate for more usage of unique_rowid(), this PR modifies the unique_rowid function by making unique_rowid as a function of timestamp, row number, source ID, the total occurrences of unique_rowid in the table schema, and instances of each unique_rowid within each row. In addition, this PR also modifies the visitor method cockroachdb#51390 by adding override methods for volatile methods like unique_rowid. Annotations containing the total occurrences of unique_rowid and unique_rowid instances within a row are stored inside evalCtx, which will be read and updated when visitor walks through the default expression at the sanitization stage, and when default expression is evaluated at each row. Partially addresses cockroachdb#48253 Release note (general change): IMPORT INTO now supports `unique_rowid()` as a default expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
pkg/sql/row/row_converter.go, line 531 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I'm pretty sure we can remove this branch; but maybe @dt has a different take.
I thought about it, and removed this branch and it passed the tests when I tested locally. Let's see what will happen when CI runs it.
pkg/sql/row/row_converter.go, line 238 at r13 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Last nit: I don't think this needs to be exported - this should also make CI happy.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also revised the PR description to talk about overrides.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @Anzoteh96, @dt, @miretskiy, @mjibson, and @pbardea)
Decided that the check for only 1 hidden column in bors r+ |
Build succeeded: |
The PR #50295 supports non-targeted columns with constant expression. This PR is a follow up to that in adding support to
unique_rowid()
.Previously, the only support given to
rowid
as a default expression is for hidden column, which is a function of timestamp, row number, and source ID (the ID of processor). To accommodate for more usage ofunique_rowid()
, this PR modifies theunique_rowid
function by makingunique_rowid
as a function of:unique_rowid
in the table schema;unique_rowid
within each row.In addition, this PR also modifies the visitor method #51390 by adding override methods for volatile methods like
unique_rowid
. Annotations containing the total occurrences ofunique_rowid
andunique_rowid
instances within a row are stored insideevalCtx
, which will be read and updated when visitor walks through the default expression at the sanitization stage, and when default expression is evaluated at each row.Partially addresses #48253
Release note (general change): IMPORT INTO now supports
unique_rowid()
as a default expression.