-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616
ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616
Conversation
|
834957d
to
d699877
Compare
619532e
to
a516bfc
Compare
This is ready for review but at the moment the performance is worse in the micro-benchmark. The micro-benchmark is dealing with a 1MB in-memory file so my guess is the fact that we have to make more read calls for partial reads is hurting. I will try it again with larger files with memory contention and with disk I/O to see if that changes anything. |
I fixed up the benchmarks somewhat and added a benchmark for cold I/O. Initial results are a bit noisy:
There is a substantial improvement in the partial-column read async case (ReadRealFileAsync/num_cols/1) at 4/16 columns. I'm not sure why that falls away as the number of columns increases so I need to investigate that still. |
Out of curiosity, how does this affect S3 performance? (Or simulated S3 via Minio/toxiproxy which would probably be less variable.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some thoughts.
I wonder, depending on selectivity and batch size, how well this stacks up to just pre-caching the entire record batch (header + body) especially on S3, disregarding the column filter. Having to load the record batch metadata interspersed across the entire file is a disadvantage compared to Parquet, which stores all that in the footer (so we can coalesce reads of data across row groups) whereas here we're only coalescing within each record batch.
I dug into the performance a bit more for the small files case (I'll do S3 soon but I think I want to do real S3 and not minio since the former supports parallelism and the latter, attached to my HDD, does not). Note: Asynchronous readers in these tests are not being consumed in parallel. So we wait until batch is returned before reading the next batch. However, asynchronous readers still issue parallel reads and use threads. Reading a single batch that needs 8 columns will trigger 8 parallel reads. Note: Even the synchronous reader will use parallel reads if only a subset of the columns are targeted. It will use the IoRecordedRandomAccessFile which then uses the read range cache which performs reads in parallel. Hot In-Memory Memory Mapped (also, arrow::io::BufferReader)Asynchronous reads should never be used in this case. A "read" is just pointer arithmetic. There are no copies. I didn't benchmark this case. Cold On-Disk Memory MappedI did not test this. I'm not sure if it is an interesting case or not. Hot In-Memory Regular FileCherry picking some interesting cases (note the rate here is based on the total buffer size of the selected columns. So selecting fewer columns shouldn't yield a higher rate).
Conclusions: This change significnatly improves performance of partial async reads to the point where partial async reads on "well-formed" files (data >> metadata) is comparable to sync partial read. Async full read is still considerably worse than async full read which is surprising but possibly due to threading overhead. This is worth investigating in a future PR. Cold In-Memory Regular File
Conclusions: This change does improve performance of partial async reads. However, it seems to come at a cost of full async reads. David's suggest to falling back to a full file read should alleviate this. In all cases the performance of partial reads deteriorates quickly. This is because we are essentially falling back to either "reading too much" (Old-Async) or random reads. The random read rate lines up with using Remaining tasks
|
4aef9b3
to
f3c0282
Compare
Generally seems to be an improvement for S3 (S3 Local means the code was run in the S3 cloud so the access was local. It seems the EC2 container's bandwidth was a significant limiting factor): |
Fair question. I've been wondering if we might want to someday investigate a variant of the Arrow IPC file format that stores batch lengths in the footer. The schema itself is there so if we had the batch lengths (which should only be 8 bytes * num_batches) then we could: A. Have O(1) random access to an individual row That being said, this seems less important when files get large enough. Even on S3 the metadata fetch is only a very small fraction of the total access time. I don't plan on investigating that as part of this PR. |
f3c0282
to
bf49c89
Compare
I split the benchmarks into a separate PR (#12150). I did a bit more analysis today. There is a substantial performance loss in a few situations: Async full file reads (e.g. when reading all columns): David's suggestion will probably work here but it's going to be a bit tricky to implement. Right now each time we load a shared record batch we are using a dedicated read cache so there is no single read cache to mark "cache" on the whole file. I plan to look at this more tomorrow. Reading from a buffer (e.g. when doing no I/O at all). For example: Old:
New:
In this particular case we are over 2x slower. Some of this slowdown is because the read range cache is calling "file->WillNeed" on these regions which triggers an madvise (which seems to mainly eat up time purely by virtue of being a system call). Removing that call gets us to I'm pretty sure the rest of the time is lost because we are using more futures which means more allocation and shared_ptr. There is no quick fix for that but I am thinking I want to tackle Future improvements in 8.0.0. There's a Windows build error I will fix. At the moment I am leaning towards including this but kind of split. The slowdowns are on an already lightning fast path (e.g. we are going from 4000ns to 8000ns for a zero-copy buffer read) for an operation we aren't yet calling in any real critical section (these calls are per-batch). The speedup is on a very slow path (e.g. going from 7.8 seconds to 1.7 seconds on 1G file read because we're reading 8 columns instead of 64 columns) but maybe not as common of one for IPC. |
…ll metadata and then use that metadata to perform partial reads when reading partial columns.
bf49c89
to
874a34b
Compare
Right, I'm not sure the slowdown on an in-memory buffer is too significant (cc @pitrou for opinions too), though it seems we should find some way for the cache to avoid madvise/WillNeed in this case (since it's purely wasted work right?) But it seems reading all columns would be a common operation and it would be nice to avoid the slowdown there. |
… the entire file. This allows us to prebuffer the entire file which is more efficient
Ah, so we fall back to the current implementation if the file is zero-copy (i.e. a buffer) or we're reading all columns? That sounds reasonable. |
Yes, that seems to remove the major slowdowns. There is still about a 15-20% slowdown on some of the async pure-buffer operations because we use a bit more Future but this seems acceptable to me now.
Yes, sounds about right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though we should address the FIXME if applicable.
…obust WaitForDictionaryReadFinished
I'll merge on green. Thanks @lidavidm for all the reviews and help on a Friday! |
Benchmark runs are scheduled for baseline = f585a47 and contender = 7029f90. 7029f90 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
In the end this PR only addresses projection pushdown on the asynchronous path. The approach laid out here could be used on the synchronous path to offer pre-buffering. Tests should be done to see if that is faster.
The memory mapped / will need issue should be solved by making auto-will-need a property of the memory mapped file and will not be addressed here.