Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: fix target column ordering bug for PGDUMP import #51065

Merged
merged 1 commit into from
Jul 9, 2020

Conversation

Anzoteh96
Copy link

@Anzoteh96 Anzoteh96 commented Jul 7, 2020

The current implementation assumes that the target columns of a PGDUMP query is the same as how they are created in the case where target columns are declared in PGDUMP file. This PR addresses it by detecting the target columns in the PGDUMP statement itself if this is the case. In addition, given that the target columns may not be well-determined at the formation of a new DatumRowConverter, so the check of unsupported default column expression is also moved to the DatumRowConverter.Row() function.

Fixed #51159

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

conv.IsTargetCol = make(map[int]struct{})
colLookup := make(map[string]int, 0)
for j, col := range conv.VisibleCols {
colLookup[col.Name] = j
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be created (at most) once per insert, not once per insert tuple. Once per file would be better, but that may be tricky.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to an outer loop so that it will be created per insert. Don't think it would work for once per file because in the case of IMPORT PGDUMP blah... the targeted columns could be different for each insert statement within the file.

} else {
colName = i.Columns[j].String()
}
ind, ok := colLookup[colName]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate that now the no column case for inserts is forced to go through a string map lookup. The if block above's purpose is to set colName so that the lookup can be used, but we really only need ind to be set correctly. So this should also be fine?:

if len(i.Columns) == 0 {
  ind = j
} else {
  colName := i.Columns[j].String();
  ind, ok = colLookup[colName];
  if !ok { ... }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I did the change as you suggested.

@@ -578,11 +584,27 @@ func (m *pgDumpReader) readFile(
if count <= resumePos {
continue
}
if expected, got := len(conv.VisibleCols), len(tuple); expected != got {
if expected, got := expectedColLen, len(tuple); expected != got {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to create expected here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (thanks for catching that!)

@@ -579,19 +583,19 @@ func (m *pgDumpReader) readFile(
}
inserts++
startingCount := count
colLookup := make(map[string]int, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 -> len(conv.VisibleCols)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Anzoteh96 Anzoteh96 force-pushed the import-pgdump branch 2 times, most recently from 4483400 to ed3370f Compare July 7, 2020 18:18
@Anzoteh96 Anzoteh96 marked this pull request as ready for review July 7, 2020 18:19
@Anzoteh96 Anzoteh96 requested review from a team and pbardea and removed request for a team July 7, 2020 18:19
@Anzoteh96 Anzoteh96 changed the title importccl: [WIP] fix bug for PGDUMP import importccl: fix bug for PGDUMP import Jul 7, 2020
@Anzoteh96 Anzoteh96 changed the title importccl: fix bug for PGDUMP import importccl: fix target column ordering bug for PGDUMP import Jul 7, 2020
}
for i, expr := range tuple {
typed, err := expr.TypeCheck(ctx, &semaCtx, conv.VisibleColTypes[i])
conv.IsTargetCol = make(map[int]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to stop doing multiple map creations per INSERT row. This has now regressed and will recreate this IsTargetCol map per INSERT row, whether or not columns are specified, which is worse than what we had before.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the conv.IsTargetCol creation (more like an update) to an outer loop so that it will be done for every INSERT statement instead of each row of the INSERT statement. Unfortunately, this is the best possible optimization in terms of the frequency of the map creation as the target columns (and possibly the order) of each insert statement might be different.

I've also made targetColMapInd slice for each INSERT statement so that each tuple of each row can know which data each tuple corresponds to.

if len(i.Columns) == 0 {
ind = j
} else {
colName := i.Columns[j].String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so what we actually need is just the type of the column, and to produce an error if a name is used that doesn't exist. So maybe something like:

if len(i.Columns) == 0 {
  typ = conv.VisibleColTypes[j]
} else {
  colName := i.Columns[j].String()
  typ, ok = conv.SomeNameLookup[colName]
  if !ok { ... }
}

And then set SomeNameLookup (renamed clearly) one time during init. It is possible some variable like this already exists, I haven't looked at what conv is. Does that work? If so it would remove creating 2 maps per insert row, which is really bad.

@Anzoteh96 Anzoteh96 force-pushed the import-pgdump branch 3 times, most recently from 8fb6b1d to 65e8ded Compare July 8, 2020 14:24
Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's possible to only create a map one time. What we need is a way to map the column index of an insert VALUES row to 1) a type and 2) the correct index in conv.Datums. We already know how to do that when insert.Columns isn't specified by using conv.VisibleColTypes. We need to be able to do that for when insert.Columns is specified. Ok, so when creating the pgDumpReader, for each table in pgDumpReader.tables, create a map that maps column names to a struct with that column's type and its index needed for conv.Datums. This way if insert.Columns is populated, there's an existing lookup map that we can use. Since we're only creating the map one time, we don't care about the tiny perf hit for creating it when it's not used, and it saves us from recreating a map for every row, which is slow since it has to be made on the heap and get GC'd. (If we do have to do this, the map should be part of pgDumpReader and it gets cleared instead of recreated each time, but I don't think we need to do this.)

The current implementation assumes that the target columns of a PGDUMP query
is the same as how they are created in the case where target columns are
declared in PGDUMP file. This PR addresses it by detecting the target columns
in the PGDUMP statement itself if this is the case. In addition, given
that the target columns may not be well-determined at the formation of a
new `DatumRowConverter`, so the check of unsupported default column
expression is also moved to the DatumRowConverter.Row() function.

Release note: None
@Anzoteh96
Copy link
Author

I still think it's possible to only create a map one time. What we need is a way to map the column index of an insert VALUES row to 1) a type and 2) the correct index in conv.Datums. We already know how to do that when insert.Columns isn't specified by using conv.VisibleColTypes. We need to be able to do that for when insert.Columns is specified. Ok, so when creating the pgDumpReader, for each table in pgDumpReader.tables, create a map that maps column names to a struct with that column's type and its index needed for conv.Datums. This way if insert.Columns is populated, there's an existing lookup map that we can use. Since we're only creating the map one time, we don't care about the tiny perf hit for creating it when it's not used, and it saves us from recreating a map for every row, which is slow since it has to be made on the heap and get GC'd. (If we do have to do this, the map should be part of pgDumpReader and it gets cleared instead of recreated each time, but I don't think we need to do this.)

I've updated the change so that the map is created at the newPgDumpReader stage and each insert statement simply reads the map to create the targetColMapInd slice for the use of each tuple statement. Now the conv.IsTargetCol still needs to be updated for each INSERT statement for the following reason: as mentioned, the targeted columns might differ for each statement, and this IsTargetCol is needed at DatumRowConverter.Row function (at least for now) because that's where the default expressions are being evaluated for non-targeted columns (and possibly an error being returned in case default expression is not supported).

Also, added tests to show that it will work in the case of target columns reordering and having supported default columns, both for IMPORT INTO ... PGDUMP FILE and IMPORT PGDUMP.

@maddyblue
Copy link
Contributor

Unfortunate we still have to create a map, but this looks like it's in a good place. Thanks for the work.

@Anzoteh96
Copy link
Author

TFTR!
bors r=mjibson

@craig
Copy link
Contributor

craig bot commented Jul 9, 2020

Build succeeded

@craig craig bot merged commit ba2deac into cockroachdb:master Jul 9, 2020
@Anzoteh96 Anzoteh96 deleted the import-pgdump branch July 10, 2020 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants