-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy: new copy IO implementation #104150
Open
cucaroach
wants to merge
3
commits into
cockroachdb:master
Choose a base branch
from
cucaroach:copyio
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
copy: new copy IO implementation #104150
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cucaroach
force-pushed
the
copyio
branch
3 times, most recently
from
June 2, 2023 12:52
6cf3ed2
to
4af728a
Compare
cucaroach
force-pushed
the
copyio
branch
2 times, most recently
from
June 28, 2023 14:36
f3433ce
to
8e178b8
Compare
cucaroach
force-pushed
the
copyio
branch
2 times, most recently
from
July 17, 2023 15:36
6030158
to
bf3ac9f
Compare
Purely test change to get %100 line coverage in readTextData. Informs: cockroachdb#99327 Epic: None Release note: None
Add 3 node config for more retry coverage and turn off backups since its a perf test. Release note: none Epic: CRDB-25321 Informs: cockroachdb#99327
Refactor COPY so that all the buffer reading takes place in a separate implementation of the io.Reader interface. This does two things, it enables the COPY implementation to efficiently handle small CopyData frames by eliminating extra buffering and exposes the COPY bytes as a pure stream of bytes which makes retry easier. It also cleans up the COPY code that handles CopyData segments straddling line boundaries, now we can just let the text/CSV reader do their thing and not have to do any writeback. The old implementation would read from a pgwire BufferedReader (copy 1) into a pgwire "ReadBuffer" (copy 2) and then push those segments into a bytes.Buffer "buf" (copy 3). The text and binary readers would read right from buf the CSV reader has its own buffer and we would read lines from buf and write them into the CSV reader's buffer (copy 4). The new approach does away with all this and the text format reads directly from a bufio.Reader (copy 1) stacked on the copy.Reader (no buffering) stacked on the pgwire BufferedReader (copy 2). For CSV the CSVReader reads directly from the copy.Reader since it has its own buffer so again only two copies off the wire. Binary reads directly from the copy.Reader since it requires no ReadLine (but the I/O is still buffered at the pgwire level). This doesn't seem to affect performance much but it gives the GC a nice break and sets up a clean solution for cockroachdb#99327. When encountering a memory usage error we used to try to let the encoder finish the row but with the more efficient buffering this started succeeds where it always failed before. Now we just don't do the hail mary and if we hit the limit we bail and return immediately, this is more OOM safe and simpler. ``` $ BENCHES=BenchmarkCopyCSVEndToEnd PKG=./pkg/sql/copy scripts/bench master Executed 1 out of 1 test: 1 test passes. name old time/op new time/op delta CopyCSVEndToEnd-10 3.94s ± 4% 3.76s ± 4% -4.59% (p=0.000 n=10+9) name old alloc/op new alloc/op delta CopyCSVEndToEnd-10 8.92GB ± 1% 8.61GB ± 2% -3.46% (p=0.000 n=10+9) name old allocs/op new allocs/op delta CopyCSVEndToEnd-10 13.8M ± 0% 11.6M ± 0% -16.08% (p=0.000 n=7+8) ``` Fixes: cockroachdb#93156 Informs: cockroachdb#99327 Release note: none Epic: CRDB-25321
cucaroach
requested review from
herkolategan and
srosenberg
and removed request for
a team
July 18, 2023 16:37
This keeps hitting flakes in the CI but I believe its ready for review. I added a new test to bring code coverage up to %100 for the code I took a hatchet to, hopefully this will allay some defect fears. I also added the results for @rafiss new copy benchmark which shows some nice gains. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
copy: new copy IO implementation
Refactor COPY so that all the buffer reading takes place in a separate
implementation of the io.Reader interface. This does two things, it
enables the COPY implementation to efficiently handle small CopyData
frames by eliminating extra buffering and exposes the COPY bytes as
a pure stream of bytes which makes retry easier. It also cleans up the
COPY code that handles CopyData segments straddling line boundaries, now
we can just let the text/CSV reader do their thing and not have to do
any writeback.
The old implementation would read from a pgwire BufferedReader (copy 1)
into a pgwire "ReadBuffer" (copy 2) and then push those segments into a
bytes.Buffer "buf" (copy 3). The text and binary readers would read
right from buf the CSV reader has its own buffer and we would read
lines from buf and write them into the CSV reader's buffer (copy 4).
The new approach does away with all this and the text format
reads directly from a bufio.Reader (copy 1) stacked on the copy.Reader
(no buffering) stacked on the pgwire BufferedReader (copy 2). For CSV
the CSVReader reads directly from the copy.Reader since it has its
own buffer so again only two copies off the wire. Binary reads directly
from the copy.Reader since it requires no ReadLine (but the I/O is still
buffered at the pgwire level).
This doesn't seem to affect performance much but it gives the GC a nice
break and sets up a clean solution for #99327.
When encountering a memory usage error we used to try to let the encoder
finish the row but with the more efficient buffering this started
succeeds where it always failed before. Now we just don't do the hail
mary and if we hit the limit we bail and return immediately, this is
more OOM safe and simpler.
Fixes: #93156
Informs: #99327
Release note: none
Epic: CRDB-25321
copy: enhance copyfrom roachtest
Add 3 node config for more retry coverage and turn off backups since its
a perf test.
Release note: none
Epic: CRDB-25321
Informs: #99327
copy: add copy carriage return test
Purely test change to get %100 line coverage in readTextData.
Informs: #99327
Epic: None
Release note: None