ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616

westonpace · 2021-11-05T03:15:02Z

In the end this PR only addresses projection pushdown on the asynchronous path. The approach laid out here could be used on the synchronous path to offer pre-buffering. Tests should be done to see if that is faster.

The memory mapped / will need issue should be solved by making auto-will-need a property of the memory mapped file and will not be addressed here.

github-actions · 2021-11-05T03:15:33Z

https://issues.apache.org/jira/browse/ARROW-14577

github-actions · 2021-11-05T03:15:35Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

westonpace · 2021-12-04T05:19:07Z

This is ready for review but at the moment the performance is worse in the micro-benchmark. The micro-benchmark is dealing with a 1MB in-memory file so my guess is the fact that we have to make more read calls for partial reads is hurting. I will try it again with larger files with memory contention and with disk I/O to see if that changes anything.

westonpace · 2021-12-08T02:31:52Z

I fixed up the benchmarks somewhat and added a benchmark for cold I/O. Initial results are a bit noisy:

ReadRealFile/1/0/real_time_mean                          +0.1096         +0.0203     882364277     979101124     141562817     144432381
ReadRealFile/4/0/real_time_mean                          +0.1379         -0.0096    1014012277    1153867283     141160362     139811386
ReadRealFile/16/0/real_time_mean                         -0.0303         +0.0033     900165884     872914568     143141575     143615607
ReadRealFile/64/0/real_time_mean                         -0.1049         +0.0007    1154140298    1033116675     140212506     140312232
ReadRealFile/256/0/real_time_mean                        +0.0990         +0.0239    1036406449    1138971720     149655837     153234293
ReadRealFile/1024/0/real_time_mean                       +0.0572         +0.0509     841348002     889506548     160709304     168895957
ReadRealFile/4096/0/real_time_mean                       -0.0244         +0.0147    1040310983    1014902847     269116155     273081737
ReadRealFile/8192/0/real_time_mean                       +0.1152         +0.0127    1073965563    1197736789     388846023     393793665
ReadRealFile/1/1/real_time_mean                          -0.0211         -0.0686     852270837     834276010     146015520     136002655
ReadRealFile/4/1/real_time_mean                          -0.0477         -0.0025     444691103     423473671      53215988      53082698
ReadRealFile/16/1/real_time_mean                         +1.5411         -0.1796     172459194     438231423      29137200      23905451
ReadRealFile/64/1/real_time_mean                         -0.1627         -0.0478    1004655612     841218794      35870849      34156124
ReadRealFile/256/1/real_time_mean                        +0.0982         -0.0181     897612093     985737042      63828576      62673771
ReadRealFile/1024/1/real_time_mean                       +0.2167         +0.5705    1020315892    1241371726      84622012     132899024
ReadRealFile/4096/1/real_time_mean                       -0.2797         -0.1929    1209181656     871000618      71425702      57648840
ReadRealFile/8192/1/real_time_mean                       +0.2138         +0.3822     938647868    1139351095     159262940     220134001
ReadRealFileAsync/1/0/real_time_mean                     +0.3462         +0.3408     906170613    1219925626       2299242       3082930
ReadRealFileAsync/4/0/real_time_mean                     -0.1024         +0.3570    1124871244    1009668198       2312991       3138757
ReadRealFileAsync/16/0/real_time_mean                    +0.2950         +0.5326     846347457    1096059130       2476560       3795647
ReadRealFileAsync/64/0/real_time_mean                    +0.0606         +0.8470    1225185369    1299444655       2964952       5476138
ReadRealFileAsync/256/0/real_time_mean                   -0.0583         +1.6426    1091256184    1027624433       4635198      12249101
ReadRealFileAsync/1024/0/real_time_mean                  +0.2313         +2.2223     925963093    1140141694      11659770      37571048
ReadRealFileAsync/4096/0/real_time_mean                  +0.2538         +1.0950     981474329    1230572649      46248110      96890904
ReadRealFileAsync/8192/0/real_time_mean                  -0.2438         +1.0463    1352917947    1023135386      86818642     177659883
ReadRealFileAsync/1/1/real_time_mean                     +0.2791         +0.7906     856795160    1095885713       2013025       3604501
ReadRealFileAsync/4/1/real_time_mean                     -0.6628         +7.4180    1022300282     344763011       2321308      19540718
ReadRealFileAsync/16/1/real_time_mean                    -0.8603         +5.4649    1240246020     173301490       2479652      16030715
ReadRealFileAsync/64/1/real_time_mean                    -0.1437         +6.2582     852187403     729739446       2156059      15649018
ReadRealFileAsync/256/1/real_time_mean                   +0.1711         +6.8878    1059134316    1240323095       2993556      23612644
ReadRealFileAsync/1024/1/real_time_mean                  +0.1014         +8.6551    1091676930    1202395405       4424422      42718098
ReadRealFileAsync/4096/1/real_time_mean                  +0.5519         +6.7698    1042589081    1618010814       8712906      67697324
ReadRealFileAsync/8192/1/real_time_mean                  -0.2005         +4.6431    1307820760    1045555603      17304062      97648656

There is a substantial improvement in the partial-column read async case (ReadRealFileAsync/num_cols/1) at 4/16 columns. I'm not sure why that falls away as the number of columns increases so I need to investigate that still.

westonpace · 2021-12-08T03:55:47Z

(fixed charts). There is some benefit at the macro-level. These are hot I/O reads which explains why we are able to see some parallelism:

westonpace · 2021-12-08T04:00:36Z

Hot In-Memory Memory Mapped (also, arrow::io::BufferReader)

Asynchronous reads should never be used in this case. A "read" is just pointer arithmetic. There are no copies. I didn't benchmark this case.

Cold On-Disk Memory Mapped

I did not test this. I'm not sure if it is an interesting case or not.

Hot In-Memory Regular File

Cherry picking some interesting cases (note the rate here is based on the total buffer size of the selected columns. So selecting fewer columns shouldn't yield a higher rate).

Sync/Async	# of columns	# of columns selected	Rate (Bps)	Note
Sync	16	16	9.79967G/s	Seems a limit on my machine for 1-thread DRAM bandwidth
Sync	16	2	12.8979G/s	Parallel reads increase DRAM bandwidth
Sync	256	256	8.73684G/s	Starting to hit CPU bottleneck from excess metadata
Sync	256	32	7.28792G/s	Since we are throttled on metadata / CPU, perf gets worse
Async	16	16	2.58248G/s	Async is quite a bit worse than baseline for full reads
Async	16	2	13.9343G/s	Async perf is similar on partial reads
Async	256	256	2.4068G/s
Async	256	32	6.8774G/s
Old-Async	16	16	2.84301G/s	Old implementation has slightly lower overhead I think
Old-Async	16	2	556.501M/s	Old implementation does not handle partial reads well
Old-Async	256	256	2.78802G/s
Old-Async	256	32	459.484M/s

Conclusions: This change significnatly improves performance of partial async reads to the point where partial async reads on "well-formed" files (data >> metadata) is comparable to sync partial read.

Async full read is still considerably worse than async full read which is surprising but possibly due to threading overhead. This is worth investigating in a future PR.

Cold In-Memory Regular File

Sync/Async	# of columns	# of columns selected	Rate (bps)	Note
Sync	16	16	111.044M/s	Baseline, HDD throughput
Sync	16	2	25.205M/s	Surprising, more below
Sync	256	256	99.8336M/s
Sync	256	32	15.2597M/s	Surprising
Async	16	16	98.5425M/s	Similar to sync, within noise but did consistently seem a bit lower
Async	16	2	54.1136M/s
Async	256	256	96.5957M/s
Async	256	32	11.911M/s	Within noise of sync result actually, seems to bottom out around a noisy 10-16
Old-Async	16	16	138.266M/s	Not just noise, old async real-file is consistently better than sync
Old-Async	16	2	17.4384M/s
Old-Async	256	256	123.721M/s
Old-Async	256	32	16.4605M/s

Conclusions: This change does improve performance of partial async reads. However, it seems to come at a cost of full async reads. David's suggest to falling back to a full file read should alleviate this.

In all cases the performance of partial reads deteriorates quickly. This is because we are essentially falling back to either "reading too much" (Old-Async) or random reads. The random read rate lines up with using fio to benchmark my disk. At 16 batches the data blocks are 520KB and with fio random reads@520KB ~ 45MBps. At 256 batches the data blocks are 32k and with fio I get ~4MBps (either fio is too pessimistic or we are able to take advantage of the pseudo-sequential nature of the reads).

Remaining tasks

Add fallback to full-file read for async
Investigate S3
Investigate multi-threaded local reads (both multiple files and consuming in parallel)
Recommend that users should structure record batches so that each column contains at least 4MB of data if they plan to be reading from disk.

westonpace · 2022-01-08T01:16:11Z

Out of curiosity, how does this affect S3 performance? (Or simulated S3 via Minio/toxiproxy which would probably be less variable.)

Generally seems to be an improvement for S3 (S3 Local means the code was run in the S3 cloud so the access was local. It seems the EC2 container's bandwidth was a significant limiting factor):

westonpace · 2022-01-08T01:20:17Z

I wonder, depending on selectivity and batch size, how well this stacks up to just pre-caching the entire record batch (header + body) especially on S3, disregarding the column filter. Having to load the record batch metadata interspersed across the entire file is a disadvantage compared to Parquet, which stores all that in the footer (so we can coalesce reads of data across row groups) whereas here we're only coalescing within each record batch.

@lidavidm

Fair question. I've been wondering if we might want to someday investigate a variant of the Arrow IPC file format that stores batch lengths in the footer. The schema itself is there so if we had the batch lengths (which should only be 8 bytes * num_batches) then we could:

A. Have O(1) random access to an individual row
B. Know from just reading the footer all the ranges that we need to access

That being said, this seems less important when files get large enough. Even on S3 the metadata fetch is only a very small fraction of the total access time. I don't plan on investigating that as part of this PR.

westonpace · 2022-01-14T03:44:46Z

I split the benchmarks into a separate PR (#12150). I did a bit more analysis today. There is a substantial performance loss in a few situations:

Async full file reads (e.g. when reading all columns): David's suggestion will probably work here but it's going to be a bit tricky to implement. Right now each time we load a shared record batch we are using a dedicated read cache so there is no single read cache to mark "cache" on the whole file. I plan to look at this more tomorrow.

Reading from a buffer (e.g. when doing no I/O at all). For example:

Old:

ReadBufferAsync/num_cols:1/is_partial:0/iterations:50000/real_time_mean         3623 ns         3623 ns           10 bytes_per_second=269.542G/s

New:

ReadBufferAsync/num_cols:1/is_partial:0/iterations:50000/real_time_mean         8651 ns         8651 ns           10 bytes_per_second=112.89G/s

In this particular case we are over 2x slower. Some of this slowdown is because the read range cache is calling "file->WillNeed" on these regions which triggers an madvise (which seems to mainly eat up time purely by virtue of being a system call). Removing that call gets us to 159G/s although I'm not really sure if that's the right path to take.

I'm pretty sure the rest of the time is lost because we are using more futures which means more allocation and shared_ptr. There is no quick fix for that but I am thinking I want to tackle Future improvements in 8.0.0.

There's a Windows build error I will fix.

At the moment I am leaning towards including this but kind of split. The slowdowns are on an already lightning fast path (e.g. we are going from 4000ns to 8000ns for a zero-copy buffer read) for an operation we aren't yet calling in any real critical section (these calls are per-batch).

The speedup is on a very slow path (e.g. going from 7.8 seconds to 1.7 seconds on 1G file read because we're reading 8 columns instead of 64 columns) but maybe not as common of one for IPC.

…ll metadata and then use that metadata to perform partial reads when reading partial columns.

…Windows

lidavidm · 2022-01-14T13:33:09Z

Right, I'm not sure the slowdown on an in-memory buffer is too significant (cc @pitrou for opinions too), though it seems we should find some way for the cache to avoid madvise/WillNeed in this case (since it's purely wasted work right?)

But it seems reading all columns would be a common operation and it would be nice to avoid the slowdown there.

… the entire file. This allows us to prebuffer the entire file which is more efficient

lidavidm · 2022-01-14T19:14:45Z

Ah, so we fall back to the current implementation if the file is zero-copy (i.e. a buffer) or we're reading all columns? That sounds reasonable.

westonpace · 2022-01-14T19:51:37Z

Ah, so we fall back to the current implementation if the file is zero-copy (i.e. a buffer) or we're reading all columns? That sounds reasonable.

Yes, that seems to remove the major slowdowns. There is still about a 15-20% slowdown on some of the async pure-buffer operations because we use a bit more Future but this seems acceptable to me now.

It'd be good to revisit this in the context of also adding column/batch statistics for filter pushdown as well. Though at that point we'd basically be Parquet/ORC without any of the fancy data encodings (which might be exactly what we're after).

Yes, sounds about right.

lidavidm

LGTM, though we should address the FIXME if applicable.

cpp/src/arrow/ipc/reader.cc

…obust WaitForDictionaryReadFinished

…ictionary issue

cpp/src/arrow/ipc/read_write_test.cc

westonpace · 2022-01-14T23:57:34Z

I'll merge on green. Thanks @lidavidm for all the reviews and help on a Friday!

ursabot · 2022-01-15T00:21:15Z

Benchmark runs are scheduled for baseline = f585a47 and contender = 7029f90. 7029f90 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.9% ⬆️0.9%] ursa-i9-9960x
[Finished ⬇️0.17% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Nov 5, 2021

westonpace mentioned this pull request Nov 5, 2021

ARROW-14548: [C++] Add madvise random support for memory mapped file #11588

Closed

westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from 834957d to d699877 Compare November 18, 2021 11:59

westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch 2 times, most recently from 619532e to a516bfc Compare December 3, 2021 23:17

westonpace marked this pull request as ready for review December 4, 2021 05:17

lidavidm mentioned this pull request Dec 8, 2021

ARROW-14429: [C++] Speed up IPC file reader on high-latency filesystems #11535

Closed

lidavidm reviewed Dec 10, 2021

View reviewed changes

cpp/src/arrow/ipc/read_write_benchmark.cc Outdated Show resolved Hide resolved

cpp/src/arrow/ipc/read_write_test.cc Show resolved Hide resolved

cpp/src/arrow/ipc/read_write_test.cc Show resolved Hide resolved

cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved

westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from 4aef9b3 to f3c0282 Compare January 7, 2022 20:52

westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from f3c0282 to bf49c89 Compare January 14, 2022 02:25

westonpace added 2 commits January 13, 2022 17:49

ARROW-14577: Change the logic in the async IPC reader to first read a…

d310d50

…ll metadata and then use that metadata to perform partial reads when reading partial columns.

ARROW-14577: Getting rid of calls to make_unique that were confusing …

874a34b

…Windows

westonpace force-pushed the feature/ARROW-14577--simplify-ipc-reader-add-cache branch from bf49c89 to 874a34b Compare January 14, 2022 03:51

westonpace added 2 commits January 14, 2022 08:31

ARROW-14577: Added back in the old style and will use it when reading…

ae782e9

… the entire file. This allows us to prebuffer the entire file which is more efficient

ARROW-14577: Lint

efce22e

ARROW-14577: Missing a header for std::iota

2f427be

westonpace requested a review from lidavidm January 14, 2022 21:09

lidavidm approved these changes Jan 14, 2022

View reviewed changes

cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved

cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved

cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved

westonpace added 2 commits January 14, 2022 12:29

ARROW-14577: Removed some called-in-one-spot methods. Put in a more r…

f53fc13

…obust WaitForDictionaryReadFinished

ARROW-14577: Added a test for mixed access verifying the just-fixed d…

5bc2984

…ictionary issue

lidavidm reviewed Jan 14, 2022

View reviewed changes

cpp/src/arrow/ipc/read_write_test.cc Outdated Show resolved Hide resolved

ARROW-14577: Removed fs and compute dependencies from IPC test

d5c9cdf

lidavidm approved these changes Jan 14, 2022

View reviewed changes

westonpace closed this in 7029f90 Jan 15, 2022

westonpace mentioned this pull request Aug 9, 2022

How to read a subset of columns from arrow ipc format file the fastest way? #13827

Open

asfimport mentioned this pull request Apr 20, 2022

[C++] Enable fine grained IO for async IPC reader #30126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616

ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616

westonpace commented Nov 5, 2021 •

edited

Loading

github-actions bot commented Nov 5, 2021

github-actions bot commented Nov 5, 2021

westonpace commented Dec 4, 2021

westonpace commented Dec 8, 2021

westonpace commented Dec 8, 2021

westonpace commented Dec 8, 2021

lidavidm commented Dec 8, 2021

lidavidm left a comment

westonpace commented Dec 16, 2021 •

edited

Loading

westonpace commented Jan 8, 2022

westonpace commented Jan 8, 2022

westonpace commented Jan 14, 2022

lidavidm commented Jan 14, 2022

lidavidm commented Jan 14, 2022

westonpace commented Jan 14, 2022

lidavidm left a comment

westonpace commented Jan 14, 2022

ursabot commented Jan 15, 2022 •

edited

Loading

ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616

ARROW-14577: [C++] Enable fine grained IO for async IPC reader #11616

Conversation

westonpace commented Nov 5, 2021 • edited Loading

github-actions bot commented Nov 5, 2021

github-actions bot commented Nov 5, 2021

westonpace commented Dec 4, 2021

westonpace commented Dec 8, 2021

westonpace commented Dec 8, 2021

westonpace commented Dec 8, 2021

lidavidm commented Dec 8, 2021

lidavidm left a comment

Choose a reason for hiding this comment

westonpace commented Dec 16, 2021 • edited Loading

Hot In-Memory Memory Mapped (also, arrow::io::BufferReader)

Cold On-Disk Memory Mapped

Hot In-Memory Regular File

Cold In-Memory Regular File

Remaining tasks

westonpace commented Jan 8, 2022

westonpace commented Jan 8, 2022

westonpace commented Jan 14, 2022

lidavidm commented Jan 14, 2022

lidavidm commented Jan 14, 2022

westonpace commented Jan 14, 2022

lidavidm left a comment

Choose a reason for hiding this comment

westonpace commented Jan 14, 2022

ursabot commented Jan 15, 2022 • edited Loading

westonpace commented Nov 5, 2021 •

edited

Loading

westonpace commented Dec 16, 2021 •

edited

Loading

ursabot commented Jan 15, 2022 •

edited

Loading