Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: cleanup how we ignore stmts in IMPORT PGDUMP #57827

Merged
merged 3 commits into from
Feb 18, 2021

Conversation

adityamaru
Copy link
Contributor

There are two kinds of unsupported statements when processing an IMPORT
PGDUMP:

  • Unparseable
  • Parseable, but unsupported

Previously, we had a set of regex statements to skip over statements
which fell into the first category. Those which fell into the second
category were skipped when we saw an AST Node of that kind.

This change does not ignore any additional statements, but removes the
regex. It does this by leaning on existing yacc grammar rules, or adding
new "unsupported" rules. These unsupported rules raise a special kind of
error which we catch during IMPORT PGDUMP, and is decorated with
required information.

This change also introduces a ignore_unsupported flag to IMPORT PGDUMP
which will allow users to skip all the stmts (of both categories
mentioned above). If not set, we will fail with either a parse or
unsupported stmt error.

Release note (sql change): New IMPORT PGDUMP option ignore_unsupported
to skip over all the unsupported PGDUMP stmts. The collection of these
statements will be appropriately documented.

@adityamaru adityamaru requested review from dt, pbardea, mokaixu and a team December 11, 2020 16:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru
Copy link
Contributor Author

@dt we discussed that maintaining existing semantics of always ignoring the previously regex-ed statements is less controversial, but while implementing I started liking the idea of biting the bullet once and grouping all ignored statements under the new option. Having this mixed state where some features are ignored by default and others guarded by this option seems a little messy. In my opinion, if the user wants to ignore all unsupported stmts they should supply that option going forward. Open to discussion about this, the change to go either way is simple!

Also open to suggestions on how to convey to the user what we have skipped. I was thinking write to nodelocal of the coordinator node if the user has given us externalio access?

@@ -3894,6 +3921,10 @@ revoke_stmt:
Grantees: $7.nameList(),
}
}
| REVOKE privileges ON SEQUENCE error
{
return unimplementedWithUnsupportedNode(sqllex, "REVOKE ON SEQUENCE", true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we do want to keep the ignoreInPGDump bool around I think defining a const skipInPGImport = true would be helpful for readability, and pass it in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the field.

// an import.
type Unsupported struct {
Err error
SkipDuringImportPGDump bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need this SkipDuringPGDump field? Do we expect there to be a node that returns Unsupported that is not skipped during IMPORT? Might it even make sense to call the node UnsupportedByImport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the field. I've kept the name as Unsupported since it now backs all unparseable SQL statements. We do only use this fact during an import but I went with a more generic name. Could go either way though.

pkg/ccl/importccl/read_import_pgdump.go Show resolved Hide resolved
pkg/ccl/importccl/import_stmt_test.go Show resolved Hide resolved
case *tree.BeginTransaction, *tree.CommitTransaction:
// ignore txns.
case *tree.SetVar, *tree.Insert, *tree.CopyFrom, copyData, *tree.Delete:
// ignore SETs and DMLs.
case *tree.Analyze:
// ANALYZE is syntatictic sugar for CreateStatistics. It can be ignored because
// the auto stats stuff will pick up the changes and run if needed.
// ANALYZE is syntatictic sugar for CreateStatistics. It can be ignored because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it looks like the whitespace here was changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

pkg/ccl/importccl/import_stmt_test.go Show resolved Hide resolved
@@ -99,6 +99,8 @@ const (
avroSchema = "schema"
avroSchemaURI = "schema_uri"

pgDumpIgnoreAllUnsupported = "ignore_unsupported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as a user I would find it confusing that there are statements which still return some sort of error like "unimplemented" with this flag (please correct me if I'm reading this wrong, but a statement CREATE OPERATOR is not "unsupported"). As you mentioned in the commit message, these would be documented but I still think it might be confusing?

Does it make sense to ignore all unimplemented nodes if we can write them to a rejected file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused, I'm going to try and categorize the types of errors we can see, and then we can decide what should be guarded by this flag.

  1. There are statements like CREATE OPERATOR which return an unimplemented error on parsing. There are several of these in sql.y. This PR replaced some (previously caught by our regex) of those unimplemented errors to instead be backed by an unsupported node, so that when we see such a node on parse.Parse() we can skip over it.

I guess my first question is: do we want to skip all sql statements which are unimplemented in our parser, or do we want to have control and only skip a subset of those. This would mean for the rest, the user would see an unable to parse error.

If the answer to the above question is yes, I would need to wholesale change all the sql.y grammar rules to be backed by our unsupported node. Albeit this could be done by just changing the unimplemented() methods they currently call.

  1. The second group of statements is the ones that have associated tree.Nodes. A statement with a tree.Node is either "supported" by IMPORT, "unsupported and ignored", or just "unsupported" and in the last case, we will throw an error. Here I think I agree with you that if we see something which made it past parsing, but we don't support it then it should not throw an error. This PR doesn't do that yet.

Let me know what you think or if I'm misunderstanding your comment. Going to cc: @mwang1026 in case he has thoughts too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's confusing UX for a user that said "I want to ignore things that are unsupported" for it to ignore just a subset, especially if we have a "reject" file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my first question is: do we want to skip all sql statements which are unimplemented in our parser, or do we want to have control and only skip a subset of those.

I think that that's also what I was after. I think that moving the ignoring logic from regexes to sql.y is good and can be done on it's own. My concern was that before we were silently ignoring these, so it wasn't surprising to a user that they would run an import and reach some unsupported statement. However, now if they want these statements to be unsupported they have to run the import with the ignore_unsupported, and they might still get "unsupported" errors which is confusing.

I don't think I have a good enough grasp of why some nodes are "unsupported and ignored" or just "unsupported" to have a more concrete opinion on the best way to resolve this though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbardea just to re-outline the new approach since there has been a lot of discussion:

The ignore_unsupported option will skip the following genres of statements:

  • All statements that have rules in sql.y and return an unimplemented error.
  • All statements that were previously ignored silently egs: BEGIN, COMMIT etc
  • All statements that were previously successfully parsed, but returned an unsupported error during tree.Node evaluation.

The first commit only addresses the first bullet point. Subsequent commits will tackle the next two bullet points.

@pbardea pbardea requested review from pbardea and removed request for mokaixu January 4, 2021 15:05
Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adityamaru adityamaru force-pushed the unsupported-pgdump branch 2 times, most recently from 5d4fb44 to 20f169e Compare January 31, 2021 16:21
@miretskiy miretskiy requested a review from pbardea February 1, 2021 14:47
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 10 files at r1, 6 of 8 files at r2, 5 of 7 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @pbardea)


out, line 1 at r3 (raw file):

Running make with -j16

file Accidentally included?


output, line 0 at r3 (raw file):
File accidentally included?


pkg/ccl/importccl/import_processor_test.go, line 957 at r3 (raw file):

		PgDump: roachpb.PgDumpOptions{
			MaxRowSize:        64 * 1024,
			IgnoreUnsupported: true,

Should you update some files in pkg/ccl/importccl/testdata/pgdump/ to add some unsupported cruft?


pkg/ccl/importccl/read_import_pgdump.go, line 108 at r3 (raw file):

			// skipped.
			if errors.HasType(err, (*tree.Unsupported)(nil)) {
				if p.ignoreUnsupportedStmts {

combine these two ifs? I would check for if p.ignoreUnsupported first.


pkg/ccl/importccl/read_import_pgdump.go, line 154 at r3 (raw file):

var (
	ignoreComments = regexp.MustCompile(`^\s*(--.*)`)

i'm curious -- can't our lexer skip comments for us?


pkg/ccl/importccl/read_import_pgdump.go, line 398 at r3 (raw file):

				if cmd.IfNotExists {
					if ignoreUnsupportedStmts {
						// Write to shunt.

is that a TODO?


pkg/ccl/importccl/read_import_pgdump.go, line 426 at r3 (raw file):

Quoted 6 lines of code…
			case *tree.AlterTableValidateConstraint:
				if ignoreUnsupportedStmts {
					// Write to shunt.
					continue
				}
				return errors.Errorf("unsupported statement: %s", stmt)

Should this branch be dropped altogether? Same as default...


pkg/sql/parser/sql.y, line 2718 at r3 (raw file):

  }

alter_unsupported_stmt:

I think you need to add appropriate parse tests (pkg/sql/parse_test.go) for changes in sql.y


pkg/sql/sem/tree/unsupported.go, line 13 at r3 (raw file):

package tree

var _ error = &Unsupported{}

should this be called UnsupportedError?

Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @pbardea)


out, line 1 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

file Accidentally included?

ah sorry, deleted.


output, line at r3 (raw file):

ah sorry, deleted.
deleted oops.


pkg/ccl/importccl/import_processor_test.go, line 957 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Should you update some files in pkg/ccl/importccl/testdata/pgdump/ to add some unsupported cruft?

some of the files already have cruft that was silently being ignored before.

egs: simple.sql has
SELECT pg_catalog.set_config('search_path', '', false);
ALTER TABLE public.simple OWNER TO postgres;

db.sql has:
CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;
COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';

I added a few more statements to simple.sql.


pkg/ccl/importccl/read_import_pgdump.go, line 108 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

combine these two ifs? I would check for if p.ignoreUnsupported first.

Done.


pkg/ccl/importccl/read_import_pgdump.go, line 154 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

i'm curious -- can't our lexer skip comments for us?

haven't had a chance to look into this yet, but I will before merging this PR.


pkg/ccl/importccl/read_import_pgdump.go, line 398 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

is that a TODO?

yup, i'll replace these in the next commit.


pkg/ccl/importccl/read_import_pgdump.go, line 426 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
			case *tree.AlterTableValidateConstraint:
				if ignoreUnsupportedStmts {
					// Write to shunt.
					continue
				}
				return errors.Errorf("unsupported statement: %s", stmt)

Should this branch be dropped altogether? Same as default...

good point, deleted!


pkg/sql/parser/sql.y, line 2718 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I think you need to add appropriate parse tests (pkg/sql/parse_test.go) for changes in sql.y

yup, missed that, added!


pkg/sql/sem/tree/unsupported.go, line 13 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

should this be called UnsupportedError?

Done.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 8 files at r4, 7 of 10 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @miretskiy, and @pbardea)


pkg/ccl/importccl/import_stmt.go, line 1281 at r5 (raw file):

	// logBuffer holds the string to be flushed to the ignoreUnsupportedLogDest.
	logBuffer *bytes.Buffer

a bit concerned we could blow up server if we try to import pgdump file w/ slew of unsupported statements...

I would be okay just limiting the number of errors we emit; followed by message... "too many errors -- rest skipped"...
Or.. i don't know, maybe just collect very small number of those errored statements (say 5-10) and just send a notice at the end with the
"sample" of errors.

Alternatively, maybe just drop logging to external storage?

@miretskiy miretskiy self-requested a review February 3, 2021 16:30
There are two kinds of unsupported statements when processing an IMPORT
PGDUMP:
- Unparseable
- Parseable, but unsupported

Previously, we had a set of regex statements to skip over statements
which fell into the first category. Those which fell into the second
category were skipped when we saw an AST Node of that kind.

This commit is responsible for the following:

- Removes regex and adds/modifies yacc rules to encompass these regex
  representations.

- Introduces a special UnsupportedNode which backs all yacc rules
  which previously returned an unimplemented error.

- Introduces an IMPORT PGDUMP option to `ignore_unsupported` statements.

- All statements which are not parseable will now be ignored if the
  option is set.

Note: the eventual goal is to have the `ignore_unsupported` flag also
allow users to ignore parseable, but unsupported statements. This will
be done in the next commit.

Release note (sql change): New IMPORT PGDUMP option `ignore_unsupported`
to skip over all the unsupported PGDUMP stmts. The collection of these
statements will be appropriately documented.
Previously, we would silently ignore several statements we might come
across in a PGDUMP file. These statements can be parsed by CRDB but are
not implemented by IMPORT.

This change brings most of these silently ignored stmts under the
umbrella of the newly introduced `ignore_unsupported` option. The
motivation for this change is to further improve the UX of this option.
If specified, we will ignore unsupported stmts, else we will fail the
IMPORT.

This will result in IMPORTs that previously succeeded by silently
ignoring some stmts to fail. The solution is to add the
`ignore_unsupported` option.

Release note (sql change): Users will now need to specify
`ignore_unsupported` to ignore all unsupported import stmts during an
IMPORT PGDUMP.
@adityamaru
Copy link
Contributor Author

This is RFAL! I added a count and a limit to the number of lines we log.

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits. I'm a fan of the log file approach!

@@ -99,6 +100,12 @@ const (
avroSchema = "schema"
avroSchemaURI = "schema_uri"

pgDumpIgnoreAllUnsupported = "ignore_unsupported"
pgDumpIgnoreShuntFileDest = "ignored_stmt_log"
pgDumpUnsupportedSchemaStmtLog = "unsupported-schema-stmts"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change these separators from - to _ for consistency with the other ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if cfg.numIgnoredStmts < pgDumpMaxLoggedStmts {
numLoggedStmts = cfg.numIgnoredStmts
}
cfg.logBuffer.WriteString(fmt.Sprintf("\nLogging %d out of %d ignored stmts.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's avoid abbreviations in user-facing messages s/stmts/statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

err = dest.WriteFile(ctx, pgDumpUnsupportedSchemaStmtLog,
bytes.NewReader(cfg.logBuffer.Bytes()))
if err != nil {
return errors.New("failed to log unsupported during IMPORT PGDUMP")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wrap the error here, and above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done.

@miretskiy miretskiy self-requested a review February 17, 2021 16:32
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 10 files at r8, 2 of 3 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @miretskiy, and @pbardea)


pkg/ccl/importccl/import_stmt.go, line 1302 at r9 (raw file):

	}
	cfg.logBuffer.WriteString(fmt.Sprintf("\nLogging %d out of %d ignored statements.\n",
		numLoggedStmts, cfg.numIgnoredStmts))

do we want to prefix every log statement with this?


pkg/ccl/importccl/import_stmt.go, line 1311 at r9 (raw file):

Quoted 8 lines of code…
	dest, err := p.ExecCfg().DistSQLSrv.ExternalStorageFromURI(ctx,
		cfg.ignoreUnsupportedLogDest, p.User())
	if err != nil {
		return errors.Wrap(err, "failed to log unsupported stmts during IMPORT PGDUMP")
	}
	defer dest.Close()
	err = dest.WriteFile(ctx, pgDumpUnsupportedSchemaStmtLog,
		bytes.NewReader(cfg.logBuffer.Bytes()))

does it make sense to accumulate log messages (like we do), but only flush them at the end?

  type logger struct {}

... 
l := logger
defer l.FlushTo(dest)
l.Log()...

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @miretskiy, and @pbardea)


pkg/ccl/importccl/read_import_pgdump.go, line 52 at r9 (raw file):

	s              *bufio.Scanner
	copy           *postgreStreamCopy
	unsupportedCfg *unsupportedStmtConfig

Do we want to perhaps introduce


pkg/ccl/importccl/read_import_pgdump.go, line 113 at r9 (raw file):

			// indicated that they want to skip these stmts during the IMPORT, then do
			// so here.
			if p.unsupportedCfg.ignoreUnsupported && errors.HasType(err, (*tree.UnsupportedError)(nil)) {

Discussed on how to better reorg the code -- basically, move this to "unsupported logger" struct and have a nice method (Log) to log.

@miretskiy miretskiy self-requested a review February 18, 2021 00:45
This is the last commit, that adds an `ignored_stmt_log` options to
IMPORT PGDUMP. This option specifies the destination we will log the
statements that we skip over during an import. This option can only be
used in conjunction with `ignore_unsupported`, else the IMPORT will
fail.

Currently, we will write to two files during the import. One during the
schema parsing phase, and another during the data ingestion phase. The
files will be called:

`unsupported-data-stmts`: Contains unparseable stmts, and unsupported
DML stmts.

`unsupported-schema-stmts`: Contains unparseable stmts, and unsupported
DDL stmts.

Release note (sql change): New IMPORT PGDUMP option `ignored_stmt_log`
that allows users to specify where they would like to log stmts that
have been skipped during an import, by virtue of being unsupported.
@adityamaru
Copy link
Contributor Author

@miretskiy this should be RFAL


// unsupportedStmtLogger is responsible for handling unsupported PGDUMP SQL
// statements seen during the import.
type unsupportedStmtLogger struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @miretskiy, and @pbardea)

@adityamaru
Copy link
Contributor Author

TFTRs!

bors r=pbardea,miretskiy

@craig craig bot merged commit 40a35fe into cockroachdb:master Feb 18, 2021
@craig
Copy link
Contributor

craig bot commented Feb 18, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants