Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: add support for importing computed columns #51321

Merged
merged 1 commit into from
Aug 19, 2020

Conversation

Anzoteh96
Copy link

@Anzoteh96 Anzoteh96 commented Jul 10, 2020

Previously, computed columns are not supported for IMPORT. This PR tries to address this problem by adding support for computed columns for IMPORT, which encompasses the following scope:

  1. For IMPORT INTO, both CSV and AVRO are supported.
  2. For IMPORT TABLE, only AVRO is supported given that a typical CSV data does not have its column mapping specified (that is, does the first column correspond to "a" or "b" in the table)?
  3. For IMPORT from dump file, only PGDUMP is supported at this point, as the MYSQLDUMP does not seem to support SQL commands containing AS STORED.

The main gist of the code is to process the computed columns at the row converter stage, and then evaluate those computed expressions for each rows once other non-computed columns are evaluated.

Release note (general change): computed columns are now supported in IMPORT for a subset of the file formats.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bobvawter
Copy link
Contributor

Would this add support for the IMPORT MYSQLDUMP to support converting its VIRTUAL column type into our STORED?

@Anzoteh96
Copy link
Author

Would this add support for the IMPORT MYSQLDUMP to support converting its VIRTUAL column type into our STORED?

I'm not really sure how VIRTUAL works in MYSQLDUMP. If it's the same as STORED in our CRDB database then it's potentially a yes.

But that might also require some fixing in getting the targeted columns for the MYSQLDUMP case (which I will look into over the next few days). PGDUMP, for example, required the fix as in #51065 (and I suspect the same might be the same for MySQL).

@Anzoteh96
Copy link
Author

@bobvawter I just had a chance to revisit this PR again, it seems like Virtual column is not supported by CRDB at all (also mentioned in issue #24455). According to this doc, too, computed columns have to be STORED.

@bobvawter
Copy link
Contributor

The main use case that I'm seeing for (enterprise) customers who want to test-drive CRDB with their existing schemas. I think that it's reasonable to convert VIRTUAL to STORED in the interest of reducing friction when folks are kicking the tires.

@Anzoteh96 Anzoteh96 force-pushed the import-computed branch 4 times, most recently from 66ee2df to 71f87f9 Compare July 29, 2020 15:22
@Anzoteh96 Anzoteh96 marked this pull request as ready for review July 29, 2020 15:22
@Anzoteh96 Anzoteh96 requested review from a team and pbardea and removed request for a team July 29, 2020 15:22
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96 and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3314 at r1 (raw file):

Quoted 30 lines of code…
				},
				{
					"name": "b",
					"type": "int",
				},
			},
		}
		schemaStr, err := json.Marshal(schema)
		require.NoError(t, err)
		codec, err := goavro.NewCodec(string(schemaStr))
		require.NoError(t, err)
		// Create an AVRO writer from the schema.
		ocf, err := goavro.NewOCFWriter(goavro.OCFConfig{
			W:     &data,
			Codec: codec,
		})
		require.NoError(t, err)
		row1 := map[string]interface{}{
			"a": 1,
			"b": 2,
		}
		row2 := map[string]interface{}{
			"a": 3,
			"b": 4,
		}
		// Add the data rows to the writer.
		require.NoError(t, ocf.Append([]interface{}{row1, row2}))
		// Retrieve the AVRO encoded data.
		avroData = data.String()
	}

Have you see avroGen interface in read_import_avro_test?


pkg/sql/sqlbase/default_exprs.go, line 114 at r1 (raw file):

// ProcessDefaultComputedColumns adds columns with DEFAULT or COMPUTED
// to cols if not present, and returns those expressions.

i think it's not a great idea to add syntactic sugar utility functions like this (and the one above).
I think the code is much clearner if you simply call the underlying MakeDefaultExpr with apprpriate column filter yourself, at the call site. Alternatively, you can just define these functions in the file you use them.

@Anzoteh96 Anzoteh96 changed the title importccl: [WIP] add support for importing computed columns importccl: add support for importing computed columns Jul 29, 2020
Copy link
Author

@Anzoteh96 Anzoteh96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3314 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
				},
				{
					"name": "b",
					"type": "int",
				},
			},
		}
		schemaStr, err := json.Marshal(schema)
		require.NoError(t, err)
		codec, err := goavro.NewCodec(string(schemaStr))
		require.NoError(t, err)
		// Create an AVRO writer from the schema.
		ocf, err := goavro.NewOCFWriter(goavro.OCFConfig{
			W:     &data,
			Codec: codec,
		})
		require.NoError(t, err)
		row1 := map[string]interface{}{
			"a": 1,
			"b": 2,
		}
		row2 := map[string]interface{}{
			"a": 3,
			"b": 4,
		}
		// Add the data rows to the writer.
		require.NoError(t, ocf.Append([]interface{}{row1, row2}))
		// Retrieve the AVRO encoded data.
		avroData = data.String()
	}

Have you see avroGen interface in read_import_avro_test?

I just looked into the avroGen interface: I could incorporate using the testHelper interface to get the schema up, but there are still more work to be done here in order to get the actual avroData that includes the row data.


pkg/sql/row/row_converter.go, line 417 at r2 (raw file):

	cols []sqlbase.ColumnDescriptor, tableDesc *sqlbase.ImmutableTableDescriptor,
) []sqlbase.ColumnDescriptor {
	colIDSet := make(map[sqlbase.ColumnID]struct{}, len(cols))

Note: this is where getRelevantColumns is called which is basically sqlbase.processColumnSet except for the special case of Default + Computed columns.


pkg/sql/sqlbase/default_exprs.go, line 114 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

i think it's not a great idea to add syntactic sugar utility functions like this (and the one above).
I think the code is much clearner if you simply call the underlying MakeDefaultExpr with apprpriate column filter yourself, at the call site. Alternatively, you can just define these functions in the file you use them.

Unfortunately processColumnSet is not exported by sqlbase. My getaround here would be to rewrite a processColumnSet-ish function in row_converter.go (known as getRelevantColumns) and then call MakeDefaultExpr from there.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r1, 3 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96 and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3314 at r1 (raw file):

Previously, Anzoteh96 (Anzo Teh) wrote…

I just looked into the avroGen interface: I could incorporate using the testHelper interface to get the schema up, but there are still more work to be done here in order to get the actual avroData that includes the row data.

Just bringing this up to your attention -- if you can use it, great; if not, ohh well.


pkg/sql/sqlbase/default_exprs.go, line 114 at r1 (raw file):

Previously, Anzoteh96 (Anzo Teh) wrote…

Unfortunately processColumnSet is not exported by sqlbase. My getaround here would be to rewrite a processColumnSet-ish function in row_converter.go (known as getRelevantColumns) and then call MakeDefaultExpr from there.

Is it possible to make it exported? Or perhaps just add exported version of that method with a todo to refactor callers of the unexported one?

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96 and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3357 at r2 (raw file):

CREATE TABLE users (a INT, b INT, c INT AS (a + b) STORED);
INSERT INTO users (a, b) VALUES (1, 2), (3, 4);
`

Do we ever actually expect PGDUMP data to look like this? I would imagine any data exported from PGDUMP would have the value already populated in the computed column.


pkg/sql/row/row_converter.go, line 658 at r2 (raw file):

		&txCtx,
		c.EvalCtx,
		&semaCtx, true)

Can you add an /* addingCols */ comment on this param?


pkg/sql/row/row_converter.go, line 662 at r2 (raw file):

		return errors.Wrapf(err, "error evaluating computed expression for IMPORT INTO")
	}
	computedCols := make([]sqlbase.ColumnDescriptor, 0)

Is there a reason we need to change this? It seems like we'll be allocating an extra slice in cases where we don't have computed exrs.

Copy link
Author

@Anzoteh96 Anzoteh96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3314 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Just bringing this up to your attention -- if you can use it, great; if not, ohh well.

I'm using a different method now: use a helper function createAvroData to create the data for me.


pkg/ccl/importccl/import_stmt_test.go, line 3357 at r2 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Do we ever actually expect PGDUMP data to look like this? I would imagine any data exported from PGDUMP would have the value already populated in the computed column.

Hmm I'm not sure...let me look into it.


pkg/sql/row/row_converter.go, line 658 at r2 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Can you add an /* addingCols */ comment on this param?

Done (interesting how Reviewable swallows this change but it's reflected here https://github.com/cockroachdb/cockroach/pull/51321/files )


pkg/sql/row/row_converter.go, line 662 at r2 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Is there a reason we need to change this? It seems like we'll be allocating an extra slice in cases where we don't have computed exrs.

Done.


pkg/sql/sqlbase/default_exprs.go, line 114 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Is it possible to make it exported? Or perhaps just add exported version of that method with a todo to refactor callers of the unexported one?

I'm making it exported for now.

@pbardea pbardea self-requested a review August 4, 2020 19:05
Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Anzoteh96, @miretskiy, and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3413 at r3 (raw file):

			format:          "PGDUMP",
			expectedResults: [][]string{{"1", "2", "3"}, {"3", "4", "7"}},
		},

Tests with both default and compute columns would also be interesting.
Also, reordering the target cols is also a good test case.


pkg/ccl/importccl/read_import_csv.go, line 140 at r3 (raw file):

			if col.IsComputed() {
				return nil,
					errors.Newf("%q is computed, which is not supported by IMPORT INTO CSV", col.Name)

Should this be a cannot write to computed column error? Are there plans for addressing the TODO here? I think that it's fairly important for that to be addressed before merging.


pkg/sql/row/row_converter.go, line 125 at r3 (raw file):

			// columns, all the columns which could possibly be referenced *are*
			// available.
			if !computedCols[i].IsComputed() {

Why do we need this? Shouldn't we not be marking non-computed columns as computed?


pkg/sql/row/row_converter.go, line 389 at r3 (raw file):

	colsForCompute := make([]sqlbase.ColumnDescriptor, len(c.tableDesc.Columns))
	for _, col := range c.tableDesc.Columns {
		colsForCompute[c.computedIVarContainer.Mapping[col.ID]] = col

How does the ordering of colsForCompute differ from c.tableDesc.Columns. This probably warrants a comment since it isn't immediately clear.
I'm also not sure that colsForCompute is the right name here since this looks like it's just a different ordering of all of the table columns?


pkg/sql/row/row_converter.go, line 407 at r3 (raw file):

	var computedCols []sqlbase.ColumnDescriptor
	if len(computeExprs) > 0 {
		computedCols = colsForCompute

Is this what we want? I would expect computedCols to just be the columns that are computed.

@Anzoteh96 Anzoteh96 requested review from a team as code owners August 12, 2020 15:03
@Anzoteh96 Anzoteh96 marked this pull request as draft August 12, 2020 15:04
@Anzoteh96 Anzoteh96 removed request for a team August 12, 2020 15:05
@Anzoteh96 Anzoteh96 force-pushed the import-computed branch 2 times, most recently from fd3eab1 to 73f3ffe Compare August 12, 2020 15:08
Copy link
Author

@Anzoteh96 Anzoteh96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/importccl/import_stmt_test.go, line 3413 at r3 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Tests with both default and compute columns would also be interesting.
Also, reordering the target cols is also a good test case.

Done.


pkg/ccl/importccl/read_import_csv.go, line 140 at r3 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Should this be a cannot write to computed column error? Are there plans for addressing the TODO here? I think that it's fairly important for that to be addressed before merging.

It should really be error that checks that IMPORT CSV with targeted columns unspecified (in particular, happens in the case of IMPORT TABLE) should not support targeted columns. The TODO is there simply because we want to check this in the beginning of an IMPORT instead of every row (though I'm not too sure how would that work)


pkg/sql/row/row_converter.go, line 125 at r3 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Why do we need this? Shouldn't we not be marking non-computed columns as computed?

So how it works is (see my comment below), if a row has no computed column, then this computedCols (or rather, as I renamed it, computedColsLookup) will be nil. Otherwise it will contain all the columns.


pkg/sql/row/row_converter.go, line 389 at r3 (raw file):

Previously, pbardea (Paul Bardea) wrote…

How does the ordering of colsForCompute differ from c.tableDesc.Columns. This probably warrants a comment since it isn't immediately clear.
I'm also not sure that colsForCompute is the right name here since this looks like it's just a different ordering of all of the table columns?

MakeComputedExprs will return the computedExprs correspond to the order of colsOrdered (formerly colsForCompute), assuming that at least one of the columns is computed. Therefore we want the colsOrdered to follow the order that corresponds to Datum.

As for the second part, would colsOrdered work?


pkg/sql/row/row_converter.go, line 407 at r3 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Is this what we want? I would expect computedCols to just be the columns that are computed.

I'm thinking of passing colsOrdered into GenerateInsertRow which would ease the process of finding the corresponding Datum to be filled with computed expression.

@Anzoteh96 Anzoteh96 marked this pull request as ready for review August 12, 2020 15:25
Copy link
Author

@Anzoteh96 Anzoteh96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/importccl/read_import_csv.go, line 140 at r3 (raw file):

Previously, Anzoteh96 (Anzo Teh) wrote…

It should really be error that checks that IMPORT CSV with targeted columns unspecified (in particular, happens in the case of IMPORT TABLE) should not support targeted columns. The TODO is there simply because we want to check this in the beginning of an IMPORT instead of every row (though I'm not too sure how would that work)

So I think this check could be moved as early to import planning stage: if the targeted columns are not specified, then computed expressions shouldn't be supported (with the exception of, perhaps, AVRO as the targeted columns are specified in the schemas).


pkg/sql/row/row_converter.go, line 389 at r3 (raw file):

Previously, Anzoteh96 (Anzo Teh) wrote…

MakeComputedExprs will return the computedExprs correspond to the order of colsOrdered (formerly colsForCompute), assuming that at least one of the columns is computed. Therefore we want the colsOrdered to follow the order that corresponds to Datum.

As for the second part, would colsOrdered work?

Edit: I've moved that to row converter stage so that these computed expressions only have to be evaluated once for every import, not for every row.


pkg/sql/row/row_converter.go, line 383 at r5 (raw file):

		return nil, errors.Wrapf(err, "error evaluating computed expression for IMPORT INTO")
	}

The computed expression is now generated here. Notice that we still want the property of having the computed expressions to in the order aligning with the datums.

@pbardea pbardea self-requested a review August 18, 2020 15:39
Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Previously, computed columns are not supported for IMPORT. This PR tries
to address this problem by adding support for computed columns for IMPORT,
which encompasses the following scope:

1. For IMPORT INTO, both CSV and AVRO are supported.

2. For IMPORT TABLE, only AVRO is supported given that a typical CSV data
does not have its column mapping specified (that is, does the first column
correspond to "a" or "b" in the table)?

3. For IMPORT from dump file, only PGDUMP is supported at this point, as
the MYSQLDUMP does not seem to support SQL commands containing AS STORED.

The main gist of the code is to process the computed columns at the row
converter stage, and then evaluate those computed expressions for each
rows once other non-computed columns are evaluated.

Release note (general change): computed columns are now supported in the
IMPORT INTO operation.
@Anzoteh96
Copy link
Author

TFTR!

bors r=pbardea

@craig
Copy link
Contributor

craig bot commented Aug 19, 2020

Build succeeded:

@craig craig bot merged commit 2dcc96b into cockroachdb:master Aug 19, 2020
@Anzoteh96 Anzoteh96 deleted the import-computed branch August 19, 2020 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants