copy: new copy IO implementation #104150

cucaroach · 2023-05-31T15:16:18Z

copy: new copy IO implementation

Refactor COPY so that all the buffer reading takes place in a separate
implementation of the io.Reader interface. This does two things, it
enables the COPY implementation to efficiently handle small CopyData
frames by eliminating extra buffering and exposes the COPY bytes as
a pure stream of bytes which makes retry easier. It also cleans up the
COPY code that handles CopyData segments straddling line boundaries, now
we can just let the text/CSV reader do their thing and not have to do
any writeback.

The old implementation would read from a pgwire BufferedReader (copy 1)
into a pgwire "ReadBuffer" (copy 2) and then push those segments into a
bytes.Buffer "buf" (copy 3). The text and binary readers would read
right from buf the CSV reader has its own buffer and we would read
lines from buf and write them into the CSV reader's buffer (copy 4).

The new approach does away with all this and the text format
reads directly from a bufio.Reader (copy 1) stacked on the copy.Reader
(no buffering) stacked on the pgwire BufferedReader (copy 2). For CSV
the CSVReader reads directly from the copy.Reader since it has its
own buffer so again only two copies off the wire. Binary reads directly
from the copy.Reader since it requires no ReadLine (but the I/O is still
buffered at the pgwire level).

This doesn't seem to affect performance much but it gives the GC a nice
break and sets up a clean solution for #99327.

When encountering a memory usage error we used to try to let the encoder
finish the row but with the more efficient buffering this started
succeeds where it always failed before. Now we just don't do the hail
mary and if we hit the limit we bail and return immediately, this is
more OOM safe and simpler.

$ BENCHES=BenchmarkCopyCSVEndToEnd PKG=./pkg/sql/copy scripts/bench master
Executed 1 out of 1 test: 1 test passes.
name                old time/op    new time/op    delta
CopyCSVEndToEnd-10     3.94s ± 4%     3.76s ± 4%   -4.59%  (p=0.000 n=10+9)

name                old alloc/op   new alloc/op   delta
CopyCSVEndToEnd-10    8.92GB ± 1%    8.61GB ± 2%   -3.46%  (p=0.000 n=10+9)

name                old allocs/op  new allocs/op  delta
CopyCSVEndToEnd-10     13.8M ± 0%     11.6M ± 0%  -16.08%  (p=0.000 n=7+8)

Fixes: #93156
Informs: #99327
Release note: none
Epic: CRDB-25321

copy: enhance copyfrom roachtest

Add 3 node config for more retry coverage and turn off backups since its
a perf test.

Release note: none
Epic: CRDB-25321
Informs: #99327

copy: add copy carriage return test

Purely test change to get %100 line coverage in readTextData.

Informs: #99327
Epic: None
Release note: None

cockroach-teamcity · 2023-05-31T15:16:27Z

This change is

Purely test change to get %100 line coverage in readTextData. Informs: cockroachdb#99327 Epic: None Release note: None

Add 3 node config for more retry coverage and turn off backups since its a perf test. Release note: none Epic: CRDB-25321 Informs: cockroachdb#99327

Refactor COPY so that all the buffer reading takes place in a separate implementation of the io.Reader interface. This does two things, it enables the COPY implementation to efficiently handle small CopyData frames by eliminating extra buffering and exposes the COPY bytes as a pure stream of bytes which makes retry easier. It also cleans up the COPY code that handles CopyData segments straddling line boundaries, now we can just let the text/CSV reader do their thing and not have to do any writeback. The old implementation would read from a pgwire BufferedReader (copy 1) into a pgwire "ReadBuffer" (copy 2) and then push those segments into a bytes.Buffer "buf" (copy 3). The text and binary readers would read right from buf the CSV reader has its own buffer and we would read lines from buf and write them into the CSV reader's buffer (copy 4). The new approach does away with all this and the text format reads directly from a bufio.Reader (copy 1) stacked on the copy.Reader (no buffering) stacked on the pgwire BufferedReader (copy 2). For CSV the CSVReader reads directly from the copy.Reader since it has its own buffer so again only two copies off the wire. Binary reads directly from the copy.Reader since it requires no ReadLine (but the I/O is still buffered at the pgwire level). This doesn't seem to affect performance much but it gives the GC a nice break and sets up a clean solution for cockroachdb#99327. When encountering a memory usage error we used to try to let the encoder finish the row but with the more efficient buffering this started succeeds where it always failed before. Now we just don't do the hail mary and if we hit the limit we bail and return immediately, this is more OOM safe and simpler. ``` $ BENCHES=BenchmarkCopyCSVEndToEnd PKG=./pkg/sql/copy scripts/bench master Executed 1 out of 1 test: 1 test passes. name old time/op new time/op delta CopyCSVEndToEnd-10 3.94s ± 4% 3.76s ± 4% -4.59% (p=0.000 n=10+9) name old alloc/op new alloc/op delta CopyCSVEndToEnd-10 8.92GB ± 1% 8.61GB ± 2% -3.46% (p=0.000 n=10+9) name old allocs/op new allocs/op delta CopyCSVEndToEnd-10 13.8M ± 0% 11.6M ± 0% -16.08% (p=0.000 n=7+8) ``` Fixes: cockroachdb#93156 Informs: cockroachdb#99327 Release note: none Epic: CRDB-25321

cucaroach · 2023-07-18T16:39:59Z

This keeps hitting flakes in the CI but I believe its ready for review. I added a new test to bring code coverage up to %100 for the code I took a hatchet to, hopefully this will allay some defect fears. I also added the results for @rafiss new copy benchmark which shows some nice gains.

cucaroach force-pushed the copyio branch 3 times, most recently from 6cf3ed2 to 4af728a Compare June 2, 2023 12:52

cucaroach force-pushed the copyio branch 2 times, most recently from f3433ce to 8e178b8 Compare June 28, 2023 14:36

cucaroach requested review from a team, mgartner and rafiss June 28, 2023 14:38

cucaroach force-pushed the copyio branch 2 times, most recently from 6030158 to bf3ac9f Compare July 17, 2023 15:36

cucaroach added 2 commits July 18, 2023 08:18

copy: add copy carriage return test

5c03045

Purely test change to get %100 line coverage in readTextData. Informs: cockroachdb#99327 Epic: None Release note: None

copy: enhance copyfrom roachtest

010a77b

Add 3 node config for more retry coverage and turn off backups since its a perf test. Release note: none Epic: CRDB-25321 Informs: cockroachdb#99327

cucaroach force-pushed the copyio branch from bf3ac9f to fa81d52 Compare July 18, 2023 13:50

cucaroach force-pushed the copyio branch from fa81d52 to 2180b26 Compare July 18, 2023 14:40

cucaroach marked this pull request as ready for review July 18, 2023 16:37

cucaroach requested review from a team as code owners July 18, 2023 16:37

cucaroach requested review from herkolategan and srosenberg and removed request for a team July 18, 2023 16:37

cucaroach removed request for herkolategan and srosenberg July 20, 2023 15:54

cucaroach mentioned this pull request Sep 14, 2023

copy: optimize I/O for small data segments #93156

Open

mgartner removed their request for review November 21, 2023 14:37

rafiss removed request for rafiss and a team December 7, 2023 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy: new copy IO implementation #104150

copy: new copy IO implementation #104150

cucaroach commented May 31, 2023 •

edited

Loading

cockroach-teamcity commented May 31, 2023

cucaroach commented Jul 18, 2023

copy: new copy IO implementation #104150

Are you sure you want to change the base?

copy: new copy IO implementation #104150

Conversation

cucaroach commented May 31, 2023 • edited Loading