Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy: new copy IO implementation #104150

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cucaroach
Copy link
Contributor

@cucaroach cucaroach commented May 31, 2023

copy: new copy IO implementation

Refactor COPY so that all the buffer reading takes place in a separate
implementation of the io.Reader interface. This does two things, it
enables the COPY implementation to efficiently handle small CopyData
frames by eliminating extra buffering and exposes the COPY bytes as
a pure stream of bytes which makes retry easier. It also cleans up the
COPY code that handles CopyData segments straddling line boundaries, now
we can just let the text/CSV reader do their thing and not have to do
any writeback.

The old implementation would read from a pgwire BufferedReader (copy 1)
into a pgwire "ReadBuffer" (copy 2) and then push those segments into a
bytes.Buffer "buf" (copy 3). The text and binary readers would read
right from buf the CSV reader has its own buffer and we would read
lines from buf and write them into the CSV reader's buffer (copy 4).

The new approach does away with all this and the text format
reads directly from a bufio.Reader (copy 1) stacked on the copy.Reader
(no buffering) stacked on the pgwire BufferedReader (copy 2). For CSV
the CSVReader reads directly from the copy.Reader since it has its
own buffer so again only two copies off the wire. Binary reads directly
from the copy.Reader since it requires no ReadLine (but the I/O is still
buffered at the pgwire level).

This doesn't seem to affect performance much but it gives the GC a nice
break and sets up a clean solution for #99327.

When encountering a memory usage error we used to try to let the encoder
finish the row but with the more efficient buffering this started
succeeds where it always failed before. Now we just don't do the hail
mary and if we hit the limit we bail and return immediately, this is
more OOM safe and simpler.

$ BENCHES=BenchmarkCopyCSVEndToEnd PKG=./pkg/sql/copy scripts/bench master
Executed 1 out of 1 test: 1 test passes.
name                old time/op    new time/op    delta
CopyCSVEndToEnd-10     3.94s ± 4%     3.76s ± 4%   -4.59%  (p=0.000 n=10+9)

name                old alloc/op   new alloc/op   delta
CopyCSVEndToEnd-10    8.92GB ± 1%    8.61GB ± 2%   -3.46%  (p=0.000 n=10+9)

name                old allocs/op  new allocs/op  delta
CopyCSVEndToEnd-10     13.8M ± 0%     11.6M ± 0%  -16.08%  (p=0.000 n=7+8)

Fixes: #93156
Informs: #99327
Release note: none
Epic: CRDB-25321

copy: enhance copyfrom roachtest

Add 3 node config for more retry coverage and turn off backups since its
a perf test.

Release note: none
Epic: CRDB-25321
Informs: #99327

copy: add copy carriage return test

Purely test change to get %100 line coverage in readTextData.

Informs: #99327
Epic: None
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@cucaroach cucaroach force-pushed the copyio branch 3 times, most recently from 6cf3ed2 to 4af728a Compare June 2, 2023 12:52
@cucaroach cucaroach force-pushed the copyio branch 2 times, most recently from f3433ce to 8e178b8 Compare June 28, 2023 14:36
@cucaroach cucaroach requested review from a team, mgartner and rafiss June 28, 2023 14:38
@cucaroach cucaroach force-pushed the copyio branch 2 times, most recently from 6030158 to bf3ac9f Compare July 17, 2023 15:36
Purely test change to get %100 line coverage in readTextData.

Informs: cockroachdb#99327
Epic: None
Release note: None
Add 3 node config for more retry coverage and turn off backups since its
a perf test.

Release note: none
Epic: CRDB-25321
Informs: cockroachdb#99327
Refactor COPY so that all the buffer reading takes place in a separate
implementation of the io.Reader interface. This does two things, it
enables the COPY implementation to efficiently handle small CopyData
frames by eliminating extra buffering and exposes the COPY bytes as
a pure stream of bytes which makes retry easier. It also cleans up the
COPY code that handles CopyData segments straddling line boundaries, now
we can just let the text/CSV reader do their thing and not have to do
any writeback.

The old implementation would read from a pgwire BufferedReader (copy 1)
into a pgwire "ReadBuffer" (copy 2) and then push those segments into a
bytes.Buffer "buf" (copy 3). The text and binary readers would read
right from buf the CSV reader has its own buffer and we would read
lines from buf and write them into the CSV reader's buffer (copy 4).

The new approach does away with all this and the text format
reads directly from a bufio.Reader (copy 1) stacked on the copy.Reader
(no buffering) stacked on the pgwire BufferedReader (copy 2). For CSV
the CSVReader reads directly from the copy.Reader since it has its
own buffer so again only two copies off the wire. Binary reads directly
from the copy.Reader since it requires no ReadLine (but the I/O is still
buffered at the pgwire level).

This doesn't seem to affect performance much but it gives the GC a nice
break and sets up a clean solution for cockroachdb#99327.

When encountering a memory usage error we used to try to let the encoder
finish the row but with the more efficient buffering this started
succeeds where it always failed before. Now we just don't do the hail
mary and if we hit the limit we bail and return immediately, this is
more OOM safe and simpler.

```
$ BENCHES=BenchmarkCopyCSVEndToEnd PKG=./pkg/sql/copy scripts/bench master
Executed 1 out of 1 test: 1 test passes.
name                old time/op    new time/op    delta
CopyCSVEndToEnd-10     3.94s ± 4%     3.76s ± 4%   -4.59%  (p=0.000 n=10+9)

name                old alloc/op   new alloc/op   delta
CopyCSVEndToEnd-10    8.92GB ± 1%    8.61GB ± 2%   -3.46%  (p=0.000 n=10+9)

name                old allocs/op  new allocs/op  delta
CopyCSVEndToEnd-10     13.8M ± 0%     11.6M ± 0%  -16.08%  (p=0.000 n=7+8)
```

Fixes: cockroachdb#93156
Informs: cockroachdb#99327
Release note: none
Epic: CRDB-25321
@cucaroach cucaroach marked this pull request as ready for review July 18, 2023 16:37
@cucaroach cucaroach requested review from a team as code owners July 18, 2023 16:37
@cucaroach cucaroach requested review from herkolategan and srosenberg and removed request for a team July 18, 2023 16:37
@cucaroach
Copy link
Contributor Author

This keeps hitting flakes in the CI but I believe its ready for review. I added a new test to bring code coverage up to %100 for the code I took a hatchet to, hopefully this will allay some defect fears. I also added the results for @rafiss new copy benchmark which shows some nice gains.

@mgartner mgartner removed their request for review November 21, 2023 14:37
@rafiss rafiss removed request for rafiss and a team December 7, 2023 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

copy: optimize I/O for small data segments
2 participants