Improve efficiency of buffered_reader. #21256

LucasSantos91 · 2024-08-30T20:22:59Z

The current implementation of buffered_reader always reads from the unbuffered reader into the internal buffer, and then dumps the data onto the destination. This is inefficient, as sometimes it's possible to read directly into the destination. The current strategy generates more memory copies and unbuffered reads than necessary. This PR implements a 3-step algorithm to buffered_reader to resolve these problems.

squeek502 · 2024-08-30T23:45:33Z

Am I wrong in thinking this approach would severely regress the performance/invalidate the "buffered" part of BufferedReader with an underlying reader that reads one byte at a time (i.e. the OneByteReadReader in the OneByte test)?. Nevermind, I was misreading the while (remaining_dest.len > buffer_size) condition--I didn't realize the dest only gets written to directly on larger than buffer size reads.

Anyway, would be good to get some benchmarks/strace logs with different read sizes.

LucasSantos91 · 2024-08-31T00:52:23Z

Anyway, would be good to get some benchmarks/strace logs with different read sizes.

Cool, thanks for the suggestion. Traces were obtained with Microsoft's Process Monitor, with the following filters:

Process Name is my application
Event class is file system
Category is read

My application is a medical image viewing program which, conveniently, only uses buffered_reader in exact one line of code. It involves many small reads when parsing the headers, followed by very large reads when parsing the images themselves. It was compiled with ReleaseFast. The code is deterministic and I confirmed that the number of traces varied around 4.2e-5% between runs. I tested with the default buffer size of 4096 and a smaller buffer size of 512. Obviously, I always loaded the exact same images. I did one run before the tests to try and cache the files, though I don't think this would change anything. Here are the results:

Code	Trace count
mine - large buffer	4671241
original - large buffer	5145833
mine - small buffer	4728204
original - small buffer	8586554

For the large buffer, my code reduced the number of reads by 9.2%. For the small buffer, my code reduced the number of reads by 44.9%

daurnimator · 2024-09-02T04:32:55Z

FYI buffered reader used to be implemented in terms of LinearFifo which did a similar thing. It was removed via #14029

matklad · 2024-09-02T10:04:14Z

Drive by thoughts:

Yes, absolutely, buffered reader should pass through large reads to the underlying reader!
If/when we make buffered reader dynamically dispatched (my pet peeve), "pass through-enabled read" should be a part of vtable
At the same time, I think this particular approach does too much and can be simplified.
The main thing is that this PR contains a "read loop", where we repeatedly read until we fill the entire buffer. But this is a read function, whose interface allows for partial reads already. So there's already a similar loop at the call-site
I think the implmeentation can be simpler if we restrict ourselves to doing at most one call to the underlying reader per our read call, and leverage that caller-supplied outer loop for the rest.
I think that's what Rust is doing, and it looks dramatically simpler:
- https://github.com/rust-lang/rust/blob/9b82580c7347f800c2550e6719e4218a60a80b28/library/std/src/io/buffered/bufreader.rs#L317-L329
- https://github.com/rust-lang/rust/blob/9b82580c7347f800c2550e6719e4218a60a80b28/library/std/src/io/buffered/bufreader/buffer.rs#L122-L142
I think their algorithm boils down to this:
- If we have non-empty internal buffer, dump (part of it) to the destination and return. No calls to the underlying reader are made
- Otherwise, if our buffer is empty:
  - If the destination is larger than our internal buffer, issue one pass-through read directly into the destination
  - Otherwise (destination is smaller than the internal buffer), issue one read to fill our buffer, copy part of it to the destination
So, assuming that the caller of the buffered reader has a loop, what would happen here naturally is that:
- First read would fully drain our buffer via one memcpy
- Then, we'll do a bunch of pass-through reads
- Finally, on the last iteration when the caller's buffer becomes short, we re-fill our internal buffer again.

LucasSantos91 · 2024-09-02T16:04:15Z

I think the implmeentation can be simpler if we restrict ourselves to doing at most one call to the underlying reader per our read call, and leverage that caller-supplied outer loop for the rest.

I considered this, the problem is that the next read would have to spend time checking where, in the 3 steps, it currently is. For instance, consider a buffer size of 512, we have 10 bytes buffered and the user requests 600 bytes.
First we would dump the bytes we have and return.
For the next call, we would check if we have enough buffered bytes to fulfill the request, which we already know we don't. We would then check if we need to do a passthrough read or read into our buffer. In this case, we would do a passthrough. Let's suppose the unbuffered reader reada 100 bytes.
For the next read, we would have to check if we have enough buffered bytes to fulfill the request, which we know we don't. We would then check if we need to do a passthrough or read into our buffer, and we already know we have to read into our buffer.
All this back and forth seemed wasteful to me, which is why I went with an algorithm that does the 3 steps at once. Either we deliver something buffered and be done, or we go all the way and get something into our buffer to be delivered the next time.
Anyways, since I have real code that depends on this, I can try both versions and see if there is a performance difference.

LucasSantos91 · 2024-09-02T22:04:17Z

I've implemented @matklad's suggestion. The code became beautifully small. My benchmarks with my own program showed no difference in performance in ReleaseFast.

LucasSantos91 · 2024-09-02T23:29:51Z

CI was failing because the tests for the BufferedReader wrongfully expected read to behave like readAll. That is, they didn't expect partial reads.

The current implementation of buffered_reader always reads from the unbuffered reader into the internal buffer, and then dumps the data onto the destination. This is inefficient, as sometimes it's possible to read directly into the destination. The current strategy generates more memory copies and unbuffered reads than necessary. This PR implements a 3-step algorithm to buffered_reader to resolve these problems.

The tests for the BufferedReader wrongfully expected read to behave like readAll. That is, they didn't expect partial reads.

andrewrk · 2024-09-24T00:21:27Z

Thank you for the collaboration.

@LucasSantos91 could you share your testing methodology as well as your latest performance data points and give me permission to copy that into the future release notes?

LucasSantos91 · 2024-09-24T14:36:22Z

I tested it with my own program, using medical images from real patients. Sadly, I can't share them.
I just timed the parsing of the images, which includes the header (lots of small reads) and the image itself (one large read). There were 910 files, totalling 500 MB of data. I did one warmup run before the benchmark, to cache the files.
The old code took 231.3 ms to complete (13.0 of standard deviation), while the new code took 218.3 ms to complete (12.0 of standard deviation), for a speedup of 5.6%

The previous implementation of buffered_reader always reads from the unbuffered reader into the internal buffer, and then dumps the data onto the destination. This is inefficient, as sometimes it's possible to read directly into the destination. The previous strategy generates more memory copies and unbuffered reads than necessary.

LucasSantos91 added 3 commits September 20, 2024 18:25

Simplified implementation.

b1e3d5e

Fix tests.

e4b1e02

The tests for the BufferedReader wrongfully expected read to behave like readAll. That is, they didn't expect partial reads.

LucasSantos91 force-pushed the master branch from bb7baa6 to e4b1e02 Compare September 20, 2024 21:27

andrewrk merged commit b19d0fb into ziglang:master Sep 24, 2024
10 checks passed

andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization standard library This issue involves writing Zig code for the standard library. release notes This PR should be mentioned in the release notes. labels Sep 24, 2024

slimsag mentioned this pull request Oct 16, 2024

all: nominate Zig 2024.10.0-mach hexops/mach#1276

Closed

43 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of buffered_reader. #21256

Improve efficiency of buffered_reader. #21256

LucasSantos91 commented Aug 30, 2024

squeek502 commented Aug 30, 2024 •

edited

Loading

LucasSantos91 commented Aug 31, 2024

daurnimator commented Sep 2, 2024

matklad commented Sep 2, 2024 •

edited

Loading

LucasSantos91 commented Sep 2, 2024 •

edited

Loading

LucasSantos91 commented Sep 2, 2024

LucasSantos91 commented Sep 2, 2024

andrewrk commented Sep 24, 2024

LucasSantos91 commented Sep 24, 2024

Improve efficiency of buffered_reader. #21256

Improve efficiency of buffered_reader. #21256

Conversation

LucasSantos91 commented Aug 30, 2024

squeek502 commented Aug 30, 2024 • edited Loading

LucasSantos91 commented Aug 31, 2024

daurnimator commented Sep 2, 2024

matklad commented Sep 2, 2024 • edited Loading

LucasSantos91 commented Sep 2, 2024 • edited Loading

LucasSantos91 commented Sep 2, 2024

LucasSantos91 commented Sep 2, 2024

andrewrk commented Sep 24, 2024

LucasSantos91 commented Sep 24, 2024

squeek502 commented Aug 30, 2024 •

edited

Loading

matklad commented Sep 2, 2024 •

edited

Loading

LucasSantos91 commented Sep 2, 2024 •

edited

Loading