Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Nov 5, 2021

In the end this PR only addresses projection pushdown on the asynchronous path. The approach laid out here could be used on the synchronous path to offer pre-buffering. Tests should be done to see if that is faster.

The memory mapped / will need issue should be solved by making auto-will-need a property of the memory mapped file and will not be addressed here.

@github-actions
Copy link

github-actions bot commented Nov 5, 2021

@github-actions
Copy link

github-actions bot commented Nov 5, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@westonpace westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from 834957d to d699877 Compare November 18, 2021 11:59
@westonpace westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch 2 times, most recently from 619532e to a516bfc Compare December 3, 2021 23:17
@westonpace westonpace marked this pull request as ready for review December 4, 2021 05:17
@westonpace
Copy link
Member Author

This is ready for review but at the moment the performance is worse in the micro-benchmark. The micro-benchmark is dealing with a 1MB in-memory file so my guess is the fact that we have to make more read calls for partial reads is hurting. I will try it again with larger files with memory contention and with disk I/O to see if that changes anything.

@westonpace
Copy link
Member Author

I fixed up the benchmarks somewhat and added a benchmark for cold I/O. Initial results are a bit noisy:

ReadRealFile/1/0/real_time_mean                          +0.1096         +0.0203     882364277     979101124     141562817     144432381
ReadRealFile/4/0/real_time_mean                          +0.1379         -0.0096    1014012277    1153867283     141160362     139811386
ReadRealFile/16/0/real_time_mean                         -0.0303         +0.0033     900165884     872914568     143141575     143615607
ReadRealFile/64/0/real_time_mean                         -0.1049         +0.0007    1154140298    1033116675     140212506     140312232
ReadRealFile/256/0/real_time_mean                        +0.0990         +0.0239    1036406449    1138971720     149655837     153234293
ReadRealFile/1024/0/real_time_mean                       +0.0572         +0.0509     841348002     889506548     160709304     168895957
ReadRealFile/4096/0/real_time_mean                       -0.0244         +0.0147    1040310983    1014902847     269116155     273081737
ReadRealFile/8192/0/real_time_mean                       +0.1152         +0.0127    1073965563    1197736789     388846023     393793665
ReadRealFile/1/1/real_time_mean                          -0.0211         -0.0686     852270837     834276010     146015520     136002655
ReadRealFile/4/1/real_time_mean                          -0.0477         -0.0025     444691103     423473671      53215988      53082698
ReadRealFile/16/1/real_time_mean                         +1.5411         -0.1796     172459194     438231423      29137200      23905451
ReadRealFile/64/1/real_time_mean                         -0.1627         -0.0478    1004655612     841218794      35870849      34156124
ReadRealFile/256/1/real_time_mean                        +0.0982         -0.0181     897612093     985737042      63828576      62673771
ReadRealFile/1024/1/real_time_mean                       +0.2167         +0.5705    1020315892    1241371726      84622012     132899024
ReadRealFile/4096/1/real_time_mean                       -0.2797         -0.1929    1209181656     871000618      71425702      57648840
ReadRealFile/8192/1/real_time_mean                       +0.2138         +0.3822     938647868    1139351095     159262940     220134001
ReadRealFileAsync/1/0/real_time_mean                     +0.3462         +0.3408     906170613    1219925626       2299242       3082930
ReadRealFileAsync/4/0/real_time_mean                     -0.1024         +0.3570    1124871244    1009668198       2312991       3138757
ReadRealFileAsync/16/0/real_time_mean                    +0.2950         +0.5326     846347457    1096059130       2476560       3795647
ReadRealFileAsync/64/0/real_time_mean                    +0.0606         +0.8470    1225185369    1299444655       2964952       5476138
ReadRealFileAsync/256/0/real_time_mean                   -0.0583         +1.6426    1091256184    1027624433       4635198      12249101
ReadRealFileAsync/1024/0/real_time_mean                  +0.2313         +2.2223     925963093    1140141694      11659770      37571048
ReadRealFileAsync/4096/0/real_time_mean                  +0.2538         +1.0950     981474329    1230572649      46248110      96890904
ReadRealFileAsync/8192/0/real_time_mean                  -0.2438         +1.0463    1352917947    1023135386      86818642     177659883
ReadRealFileAsync/1/1/real_time_mean                     +0.2791         +0.7906     856795160    1095885713       2013025       3604501
ReadRealFileAsync/4/1/real_time_mean                     -0.6628         +7.4180    1022300282     344763011       2321308      19540718
ReadRealFileAsync/16/1/real_time_mean                    -0.8603         +5.4649    1240246020     173301490       2479652      16030715
ReadRealFileAsync/64/1/real_time_mean                    -0.1437         +6.2582     852187403     729739446       2156059      15649018
ReadRealFileAsync/256/1/real_time_mean                   +0.1711         +6.8878    1059134316    1240323095       2993556      23612644
ReadRealFileAsync/1024/1/real_time_mean                  +0.1014         +8.6551    1091676930    1202395405       4424422      42718098
ReadRealFileAsync/4096/1/real_time_mean                  +0.5519         +6.7698    1042589081    1618010814       8712906      67697324
ReadRealFileAsync/8192/1/real_time_mean                  -0.2005         +4.6431    1307820760    1045555603      17304062      97648656

There is a substantial improvement in the partial-column read async case (ReadRealFileAsync/num_cols/1) at 4/16 columns. I'm not sure why that falls away as the number of columns increases so I need to investigate that still.

@westonpace
Copy link
Member Author

(fixed charts). There is some benefit at the macro-level. These are hot I/O reads which explains why we are able to see some parallelism:

async
old
sync

@westonpace
Copy link
Member Author

Similar results with cold I/O
async-cold
old-cold
sync-cold

@lidavidm
Copy link
Member

lidavidm commented Dec 8, 2021

Out of curiosity, how does this affect S3 performance? (Or simulated S3 via Minio/toxiproxy which would probably be less variable.)

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some thoughts.

I wonder, depending on selectivity and batch size, how well this stacks up to just pre-caching the entire record batch (header + body) especially on S3, disregarding the column filter. Having to load the record batch metadata interspersed across the entire file is a disadvantage compared to Parquet, which stores all that in the footer (so we can coalesce reads of data across row groups) whereas here we're only coalescing within each record batch.

cpp/src/arrow/ipc/read_write_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/read_write_test.cc Show resolved Hide resolved
cpp/src/arrow/ipc/read_write_test.cc Show resolved Hide resolved
cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
@westonpace
Copy link
Member Author

westonpace commented Dec 16, 2021

I dug into the performance a bit more for the small files case (I'll do S3 soon but I think I want to do real S3 and not minio since the former supports parallelism and the latter, attached to my HDD, does not).

Note: Asynchronous readers in these tests are not being consumed in parallel. So we wait until batch is returned before reading the next batch. However, asynchronous readers still issue parallel reads and use threads. Reading a single batch that needs 8 columns will trigger 8 parallel reads.

Note: Even the synchronous reader will use parallel reads if only a subset of the columns are targeted. It will use the IoRecordedRandomAccessFile which then uses the read range cache which performs reads in parallel.

Hot In-Memory Memory Mapped (also, arrow::io::BufferReader)

Asynchronous reads should never be used in this case. A "read" is just pointer arithmetic. There are no copies. I didn't benchmark this case.

Cold On-Disk Memory Mapped

I did not test this. I'm not sure if it is an interesting case or not.

Hot In-Memory Regular File

Cherry picking some interesting cases (note the rate here is based on the total buffer size of the selected columns. So selecting fewer columns shouldn't yield a higher rate).

Sync/Async # of columns # of columns selected Rate (Bps) Note
Sync 16 16 9.79967G/s Seems a limit on my machine for 1-thread DRAM bandwidth
Sync 16 2 12.8979G/s Parallel reads increase DRAM bandwidth
Sync 256 256 8.73684G/s Starting to hit CPU bottleneck from excess metadata
Sync 256 32 7.28792G/s Since we are throttled on metadata / CPU, perf gets worse
Async 16 16 2.58248G/s Async is quite a bit worse than baseline for full reads
Async 16 2 13.9343G/s Async perf is similar on partial reads
Async 256 256 2.4068G/s
Async 256 32 6.8774G/s
Old-Async 16 16 2.84301G/s Old implementation has slightly lower overhead I think
Old-Async 16 2 556.501M/s Old implementation does not handle partial reads well
Old-Async 256 256 2.78802G/s
Old-Async 256 32 459.484M/s

Conclusions: This change significnatly improves performance of partial async reads to the point where partial async reads on "well-formed" files (data >> metadata) is comparable to sync partial read.

Async full read is still considerably worse than async full read which is surprising but possibly due to threading overhead. This is worth investigating in a future PR.

Cold In-Memory Regular File

Sync/Async # of columns # of columns selected Rate (bps) Note
Sync 16 16 111.044M/s Baseline, HDD throughput
Sync 16 2 25.205M/s Surprising, more below
Sync 256 256 99.8336M/s
Sync 256 32 15.2597M/s Surprising
Async 16 16 98.5425M/s Similar to sync, within noise but did consistently seem a bit lower
Async 16 2 54.1136M/s
Async 256 256 96.5957M/s
Async 256 32 11.911M/s Within noise of sync result actually, seems to bottom out around a noisy 10-16
Old-Async 16 16 138.266M/s Not just noise, old async real-file is consistently better than sync
Old-Async 16 2 17.4384M/s
Old-Async 256 256 123.721M/s
Old-Async 256 32 16.4605M/s

Conclusions: This change does improve performance of partial async reads. However, it seems to come at a cost of full async reads. David's suggest to falling back to a full file read should alleviate this.

In all cases the performance of partial reads deteriorates quickly. This is because we are essentially falling back to either "reading too much" (Old-Async) or random reads. The random read rate lines up with using fio to benchmark my disk. At 16 batches the data blocks are 520KB and with fio random reads@520KB ~ 45MBps. At 256 batches the data blocks are 32k and with fio I get ~4MBps (either fio is too pessimistic or we are able to take advantage of the pseudo-sequential nature of the reads).

Remaining tasks

  • Add fallback to full-file read for async
  • Investigate S3
  • Investigate multi-threaded local reads (both multiple files and consuming in parallel)
  • Recommend that users should structure record batches so that each column contains at least 4MB of data if they plan to be reading from disk.

@westonpace westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from 4aef9b3 to f3c0282 Compare January 7, 2022 20:52
@westonpace
Copy link
Member Author

Out of curiosity, how does this affect S3 performance? (Or simulated S3 via Minio/toxiproxy which would probably be less variable.)

Generally seems to be an improvement for S3 (S3 Local means the code was run in the S3 cloud so the access was local. It seems the EC2 container's bandwidth was a significant limiting factor):

s3l_new
s3l_old
s3_new
s3_old

@westonpace
Copy link
Member Author

I wonder, depending on selectivity and batch size, how well this stacks up to just pre-caching the entire record batch (header + body) especially on S3, disregarding the column filter. Having to load the record batch metadata interspersed across the entire file is a disadvantage compared to Parquet, which stores all that in the footer (so we can coalesce reads of data across row groups) whereas here we're only coalescing within each record batch.

@lidavidm

Fair question. I've been wondering if we might want to someday investigate a variant of the Arrow IPC file format that stores batch lengths in the footer. The schema itself is there so if we had the batch lengths (which should only be 8 bytes * num_batches) then we could:

A. Have O(1) random access to an individual row
B. Know from just reading the footer all the ranges that we need to access

That being said, this seems less important when files get large enough. Even on S3 the metadata fetch is only a very small fraction of the total access time. I don't plan on investigating that as part of this PR.

@westonpace westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from f3c0282 to bf49c89 Compare January 14, 2022 02:25
@westonpace
Copy link
Member Author

I split the benchmarks into a separate PR (#12150). I did a bit more analysis today. There is a substantial performance loss in a few situations:

Async full file reads (e.g. when reading all columns): David's suggestion will probably work here but it's going to be a bit tricky to implement. Right now each time we load a shared record batch we are using a dedicated read cache so there is no single read cache to mark "cache" on the whole file. I plan to look at this more tomorrow.

Reading from a buffer (e.g. when doing no I/O at all). For example:

Old:

ReadBufferAsync/num_cols:1/is_partial:0/iterations:50000/real_time_mean         3623 ns         3623 ns           10 bytes_per_second=269.542G/s

New:

ReadBufferAsync/num_cols:1/is_partial:0/iterations:50000/real_time_mean         8651 ns         8651 ns           10 bytes_per_second=112.89G/s

In this particular case we are over 2x slower. Some of this slowdown is because the read range cache is calling "file->WillNeed" on these regions which triggers an madvise (which seems to mainly eat up time purely by virtue of being a system call). Removing that call gets us to 159G/s although I'm not really sure if that's the right path to take.

I'm pretty sure the rest of the time is lost because we are using more futures which means more allocation and shared_ptr. There is no quick fix for that but I am thinking I want to tackle Future improvements in 8.0.0.

There's a Windows build error I will fix.

At the moment I am leaning towards including this but kind of split. The slowdowns are on an already lightning fast path (e.g. we are going from 4000ns to 8000ns for a zero-copy buffer read) for an operation we aren't yet calling in any real critical section (these calls are per-batch).

The speedup is on a very slow path (e.g. going from 7.8 seconds to 1.7 seconds on 1G file read because we're reading 8 columns instead of 64 columns) but maybe not as common of one for IPC.

…ll metadata and then use that metadata to perform partial reads when reading partial columns.
@westonpace westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from bf49c89 to 874a34b Compare January 14, 2022 03:51
@lidavidm
Copy link
Member

Right, I'm not sure the slowdown on an in-memory buffer is too significant (cc @pitrou for opinions too), though it seems we should find some way for the cache to avoid madvise/WillNeed in this case (since it's purely wasted work right?)

But it seems reading all columns would be a common operation and it would be nice to avoid the slowdown there.

… the entire file. This allows us to prebuffer the entire file which is more efficient
@lidavidm
Copy link
Member

Ah, so we fall back to the current implementation if the file is zero-copy (i.e. a buffer) or we're reading all columns? That sounds reasonable.

@westonpace
Copy link
Member Author

Ah, so we fall back to the current implementation if the file is zero-copy (i.e. a buffer) or we're reading all columns? That sounds reasonable.

Yes, that seems to remove the major slowdowns. There is still about a 15-20% slowdown on some of the async pure-buffer operations because we use a bit more Future but this seems acceptable to me now.

It'd be good to revisit this in the context of also adding column/batch statistics for filter pushdown as well. Though at that point we'd basically be Parquet/ORC without any of the fancy data encodings (which might be exactly what we're after).

Yes, sounds about right.

@westonpace westonpace requested a review from lidavidm January 14, 2022 21:09
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though we should address the FIXME if applicable.

cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
@westonpace
Copy link
Member Author

I'll merge on green. Thanks @lidavidm for all the reviews and help on a Friday!

@ursabot
Copy link

ursabot commented Jan 15, 2022

Benchmark runs are scheduled for baseline = f585a47 and contender = 7029f90. 7029f90 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.9% ⬆️0.9%] ursa-i9-9960x
[Finished ⬇️0.17% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants