-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: support COPY protocol #8756
Conversation
(This will need a rebase.) Here you are, overall this looks nice.
Reviewed 15 of 15 files at r1. sql/copy.go, line 15 [r1] (raw file):
Perhaps add an e-mail address here for consistency with other source files. sql/copy.go, line 28 [r1] (raw file):
For consistency with other planNodes you could add a comment here to explain what this node does. In particular a part of this comment must explain that this node does not use the expandPlan/Start/Next protocol, and explain instead what it does use. sql/copy.go, line 84 [r1] (raw file):
Needs a check earlier on that another COPY FROM is not already in progress. ( Also I am 99% certain you want to assign this pointer in the Start() method, not here. Otherwise a Prepare of a COPY will probably not do what you want. sql/copy.go, line 116 [r1] (raw file):
comment should be sql/copy.go, line 138 [r1] (raw file):
Does this protocol specify what happens if the last line before EOF is not NL-terminated? Right now it looks like this code would chomp the last useful char from the data. If the spec says anything please document it here. If not this needs to be tested with pg and then a comment added with your test results. sql/copy.go, line 143 [r1] (raw file):
Perhaps you could initialize fieldDelim as sql/copy.go, line 149 [r1] (raw file):
Extract this decode logic to its own function. sql/copy.go, line 211 [r1] (raw file):
I am not happy about the code duplication between this loop and Scanner.scanString() in parser/scan.go. Is it possible to extract and factor just the escape handling code from both functions? sql/copy.go, line 224 [r1] (raw file):
How so "1 or 2"? Can you give an example? Where does this spec come from? This looks like am ambiguous syntax definition. sql/copy.go, line 230 [r1] (raw file):
Don't use a map for this. hex/octal conversion should be done arithmetically, CPU cycles are cheaper than memory. For example:
sql/copy.go, line 244 [r1] (raw file):
Same as above. sql/copy.go, line 269 [r1] (raw file):
ditto re. duplication with scan.go. sql/copy.go, line 316 [r1] (raw file):
Move this earlier at the start of the file and document it. sql/copy_in_test.go, line 15 [r1] (raw file):
ditto re. email address. sql/copy_test.go, line 15 [r1] (raw file):
ditto sql/pgwire/v3.go, line 309 [r1] (raw file):
I don't get it; why are they ignored here? I think I understand that these messages can only be received after a COPY statement was executed (which causes sendResponse to start processing copyIn(), but then that means that if this code is reached here there was a protocol error. I'd expect that error to be reported. sql/pgwire/v3.go, line 842 [r1] (raw file):
I'd suggest creating a constant sql/pgwire/v3.go, line 874 [r1] (raw file):
Are we sure we want to ignore this? I'd think that at least Flush indicates to process what's already in the buffer, regardless of what comes afterwards. Comments from Reviewable |
Done. Review status: 8 of 15 files reviewed at latest revision, 19 unresolved discussions, some commit checks pending. sql/copy.go, line 15 [r1] (raw file):
|
modulo extracting/documenting Reviewed 7 of 7 files at r2. sql/copy.go, line 116 [r1] (raw file):
|
Implement COPY by maintaining data and row buffers in the SQL Session. When the row buffer is large enough it is executed as an insertNode. The COPY protocol is difficult to benchmark with the current state of lib/pq, which only supports COPY within transactions. We would like to benchmark non-trivial (100k+ rows) datasets, but doing a single transaction with 100k rows performs poorly in cockroach. I thus performed some ad-hoc benchmarking using single node with comparisons also to Postgres. I generated a random dataset of 300k rows in Postgres. Then I ran `pg_dump` and `pg_dump --inserts` to fetch backups of that table in COPY and INSERT modes. I inserted that data into cockroach and then used `cockroach dump` to again extract it. This is because `pg_dump --inserts` writes INSERT statements with one VALUE row per INSERT, which is inefficient for cockroach. `cockroach dump` groups them by 100 rows per INSERT, which is also the rate at which COPY rows are grouped. The COPY pg_dump file and cockroach dump file were timed and inserted each into an empty cockroach node. Both ran in about 25s: there was no significant performance difference between COPY and INSERT. The same file in Postgres took 2s to COPY and 8s with the cockroach dump file. The conclusion here is that cockroach write speed is far and away the bottleneck, and speeding up network and parse operations is not going to produce any noticeable speedup. This change is still useful, however, because this is a common protocol for postgres backups. Our CLI tool does not support COPY syntax yet. lib/pq would need a large refactor and enhancement to support non-transactional COPY as it is cleverly implemented using the Go database/sql statement API. Adding this support is TODO. Fixes #8585
Implement COPY by maintaining data and row buffers in the SQL Session. When
the row buffer is large enough it is executed as an insertNode.
The COPY protocol is difficult to benchmark with the current state of
lib/pq, which only supports COPY within transactions. We would like to
benchmark non-trivial (100k+ rows) datasets, but doing a single transaction
with 100k rows performs poorly in cockroach. I thus performed some ad-hoc
benchmarking using single node with comparisons also to Postgres.
I generated a random dataset of 300k rows in Postgres. Then I ran
pg_dump
and
pg_dump --inserts
to fetch backups of that table in COPY and INSERTmodes. I inserted that data into cockroach and then used
cockroach dump
to again extract it. This is because
pg_dump --inserts
writes INSERTstatements with one VALUE row per INSERT, which is inefficient for
cockroach.
cockroach dump
groups them by 100 rows per INSERT, which isalso the rate at which COPY rows are grouped.
The COPY pg_dump file and cockroach dump file were timed and inserted each
into an empty cockroach node. Both ran in about 25s: there was no
significant performance difference between COPY and INSERT. The same file
in Postgres took 2s to COPY and 8s with the cockroach dump file.
The conclusion here is that cockroach write speed is far and away the
bottleneck, and speeding up network and parse operations is not going to
produce any noticeable speedup.
This change is still useful, however, because this is a common protocol for
postgres backups.
Our CLI tool does not support COPY syntax yet. lib/pq would need a large
refactor and enhancement to support non-transactional COPY as it is
cleverly implemented using the Go database/sql statement API. Adding this
support is TODO.
Fixes #8585
This change is