Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of buffered_reader. #21256

Merged
merged 3 commits into from
Sep 24, 2024
Merged

Conversation

LucasSantos91
Copy link
Contributor

The current implementation of buffered_reader always reads from the unbuffered reader into the internal buffer, and then dumps the data onto the destination. This is inefficient, as sometimes it's possible to read directly into the destination. The current strategy generates more memory copies and unbuffered reads than necessary. This PR implements a 3-step algorithm to buffered_reader to resolve these problems.

@squeek502
Copy link
Collaborator

squeek502 commented Aug 30, 2024

Am I wrong in thinking this approach would severely regress the performance/invalidate the "buffered" part of BufferedReader with an underlying reader that reads one byte at a time (i.e. the OneByteReadReader in the OneByte test)?. Nevermind, I was misreading the while (remaining_dest.len > buffer_size) condition--I didn't realize the dest only gets written to directly on larger than buffer size reads.

Anyway, would be good to get some benchmarks/strace logs with different read sizes.

@LucasSantos91
Copy link
Contributor Author

Anyway, would be good to get some benchmarks/strace logs with different read sizes.

Cool, thanks for the suggestion. Traces were obtained with Microsoft's Process Monitor, with the following filters:

  • Process Name is my application
  • Event class is file system
  • Category is read

My application is a medical image viewing program which, conveniently, only uses buffered_reader in exact one line of code. It involves many small reads when parsing the headers, followed by very large reads when parsing the images themselves. It was compiled with ReleaseFast. The code is deterministic and I confirmed that the number of traces varied around 4.2e-5% between runs. I tested with the default buffer size of 4096 and a smaller buffer size of 512. Obviously, I always loaded the exact same images. I did one run before the tests to try and cache the files, though I don't think this would change anything. Here are the results:

Code Trace count
mine - large buffer 4671241
original - large buffer 5145833
mine - small buffer 4728204
original - small buffer 8586554

For the large buffer, my code reduced the number of reads by 9.2%. For the small buffer, my code reduced the number of reads by 44.9%

@daurnimator
Copy link
Contributor

FYI buffered reader used to be implemented in terms of LinearFifo which did a similar thing. It was removed via #14029

@matklad
Copy link
Contributor

matklad commented Sep 2, 2024

Drive by thoughts:

  • Yes, absolutely, buffered reader should pass through large reads to the underlying reader!
  • If/when we make buffered reader dynamically dispatched (my pet peeve), "pass through-enabled read" should be a part of vtable
  • At the same time, I think this particular approach does too much and can be simplified.
  • The main thing is that this PR contains a "read loop", where we repeatedly read until we fill the entire buffer. But this is a read function, whose interface allows for partial reads already. So there's already a similar loop at the call-site
  • I think the implmeentation can be simpler if we restrict ourselves to doing at most one call to the underlying reader per our read call, and leverage that caller-supplied outer loop for the rest.
  • I think that's what Rust is doing, and it looks dramatically simpler:
  • I think their algorithm boils down to this:
    • If we have non-empty internal buffer, dump (part of it) to the destination and return. No calls to the underlying reader are made
    • Otherwise, if our buffer is empty:
      • If the destination is larger than our internal buffer, issue one pass-through read directly into the destination
      • Otherwise (destination is smaller than the internal buffer), issue one read to fill our buffer, copy part of it to the destination
  • So, assuming that the caller of the buffered reader has a loop, what would happen here naturally is that:
    • First read would fully drain our buffer via one memcpy
    • Then, we'll do a bunch of pass-through reads
    • Finally, on the last iteration when the caller's buffer becomes short, we re-fill our internal buffer again.

@LucasSantos91
Copy link
Contributor Author

LucasSantos91 commented Sep 2, 2024

  • I think the implmeentation can be simpler if we restrict ourselves to doing at most one call to the underlying reader per our read call, and leverage that caller-supplied outer loop for the rest.

I considered this, the problem is that the next read would have to spend time checking where, in the 3 steps, it currently is. For instance, consider a buffer size of 512, we have 10 bytes buffered and the user requests 600 bytes.
First we would dump the bytes we have and return.
For the next call, we would check if we have enough buffered bytes to fulfill the request, which we already know we don't. We would then check if we need to do a passthrough read or read into our buffer. In this case, we would do a passthrough. Let's suppose the unbuffered reader reada 100 bytes.
For the next read, we would have to check if we have enough buffered bytes to fulfill the request, which we know we don't. We would then check if we need to do a passthrough or read into our buffer, and we already know we have to read into our buffer.
All this back and forth seemed wasteful to me, which is why I went with an algorithm that does the 3 steps at once. Either we deliver something buffered and be done, or we go all the way and get something into our buffer to be delivered the next time.
Anyways, since I have real code that depends on this, I can try both versions and see if there is a performance difference.

@LucasSantos91
Copy link
Contributor Author

I've implemented @matklad's suggestion. The code became beautifully small. My benchmarks with my own program showed no difference in performance in ReleaseFast.

@LucasSantos91
Copy link
Contributor Author

CI was failing because the tests for the BufferedReader wrongfully expected read to behave like readAll. That is, they didn't expect partial reads.

The current implementation of buffered_reader always reads from the
unbuffered reader into the internal buffer, and then dumps the data onto
the destination. This is inefficient, as sometimes it's possible to read
directly into the destination. The current strategy generates more
memory copies and unbuffered reads than necessary.
This PR implements a 3-step algorithm to buffered_reader to resolve
these problems.
The tests for the BufferedReader wrongfully expected read to behave like readAll. That is, they didn't expect partial reads.
@andrewrk andrewrk merged commit b19d0fb into ziglang:master Sep 24, 2024
10 checks passed
@andrewrk andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization standard library This issue involves writing Zig code for the standard library. release notes This PR should be mentioned in the release notes. labels Sep 24, 2024
@andrewrk
Copy link
Member

Thank you for the collaboration.

@LucasSantos91 could you share your testing methodology as well as your latest performance data points and give me permission to copy that into the future release notes?

@LucasSantos91
Copy link
Contributor Author

I tested it with my own program, using medical images from real patients. Sadly, I can't share them.
I just timed the parsing of the images, which includes the header (lots of small reads) and the image itself (one large read). There were 910 files, totalling 500 MB of data. I did one warmup run before the benchmark, to cache the files.
The old code took 231.3 ms to complete (13.0 of standard deviation), while the new code took 218.3 ms to complete (12.0 of standard deviation), for a speedup of 5.6%

DivergentClouds pushed a commit to DivergentClouds/zig that referenced this pull request Sep 24, 2024
The previous implementation of buffered_reader always reads from the
unbuffered reader into the internal buffer, and then dumps the data onto
the destination. This is inefficient, as sometimes it's possible to read
directly into the destination. The previous strategy generates more
memory copies and unbuffered reads than necessary.
richerfu pushed a commit to richerfu/zig that referenced this pull request Oct 28, 2024
The previous implementation of buffered_reader always reads from the
unbuffered reader into the internal buffer, and then dumps the data onto
the destination. This is inefficient, as sometimes it's possible to read
directly into the destination. The previous strategy generates more
memory copies and unbuffered reads than necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization release notes This PR should be mentioned in the release notes. standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants