[draft] accounts: parallel load across txs via core-pinned threadpool #17774

jon-chuang · 2021-06-07T05:16:59Z

Problem

Currently, load_and_execute_transactions loads sequentially within bank. This means the worst case latency for multiple sequential page faults can stall throughput.

Summary of Changes

Parallelise loads across TXs using an over-provisioned thread pool pinned to a fixed number of logical cores. The thread-pool is over-provisioned (by default by 8x) to deal with OS blocking threads when page faulting.

Fixes: #17765 #17761

CC (for early feedback): @sakridge @carllin

TODO:

Check no perf regression when running bench TPS
Check for perf improvement when triggering OS page evictions, thus forcing page faults.

TODO: Pipeline load->execution in banking and replay stage to hide latency.

carllin · 2021-06-08T02:18:23Z

runtime/src/pinned_threads.rs

+        }
+        let id_for_spawn = self.cores[self.core_id_pointer];
+        b.spawn(move || {
+            core_affinity::set_for_current(id_for_spawn);


Could you expand on in a comment what the problem we're trying to solve with pinning the threads/benefit of pinning these threads to particular cores would be (cache locality?) Also perhaps a new bench in runtime/benches/accounts.rs to demonstrate this difference?

Check no perf regression when running bench TPS

Might also be a good idea to try running this on a full validator to see if there are any unexpected drawdowns in other components of the validator. For instance I think the PoH thread is pinned to DEFAULT_PINNED_CPU_CORE: usize = 0;

Hey @carllin

My apologies that the thoughts are a little jumbled and still evolving on the issue. Will eventually summarise investigations in a report of sorts - there is a larger question of avoiding mmap, as do most modern SSD-oriented DBs. And it ought to at least be spelled out what the potential benefits are.

The biggest benefit may be to reduce memory requirements while improving the worst-case critical path (this is related to some security, performance, and pricing issues, such as the tx cost model).

This is important if considering a persistent index. I think the persistent index should be written using io_uring and fallback on polling disk read syscalls handled via async/await.

Pinning the threads in particular was just an idea to reduce resource contention, so that the loading threads don’t eat up the resources of the entire CPU.

However, I’ve found by measuring the time to perform load_accounts_parallel when running bench-tps that pinning more than 2 threads per core makes things worse. But, pinning 2 threads to each CPU rather than to a subset of the CPUs is actually the best option for the thread pool (250-500us for 128 tx batch). One can avoid core 0 for PoH.

Note that in the replay stage, a different scenario arises. One can no longer “cheat” by parallelising across threads. That’s because the default behaviour is to already saturate the threads with load_and_execute. What this means is that essentially, one has the overhead of submitting to rayon but none of the benefits.

The fundamental problem about spinning up threads is that one doesn't know in advance when one is going to trigger a page fault. Spinning up a thread when MMap page hits is wasteful. But not spinning up a thread when there is a fault is going to increase the latency. The solution of asking a scheduler like Rayon is imperfect at best - as seen from the results below, there isn't a solution that works in all cases, and the worst-case critical path is also not fully mitigated.

Ideally, the application layer knows when data is cached and when it has to hit disk - and can thus spin up a thread as needed, or have an async runtime than doesn't even require spinning up more OS threads, rather than relying on the OS as a black box that can arbitrarily stall a thread's execution.

In other words, having the I/O operation occur at the application/kernel boundary is much better than having it occur fully in the kernel layer.

Results - fully in RAM (approximate from eyeballing):

Description Time for 128 tx load_accounts_parallel (banking stage) (replay stage)

Pin 2 threads to all logical cores 250-500us avg: 12_000us tail: 25_000us

Pin 1 thread to all logical cores 250-500us avg: 25_000us tail: 50_000us

Pin 4 threads to all logical cores 800-1800us avg: 25_000us tail 50_000us

Pin 1 thread to 1/4 of logical cores 800-1800us NIL

Ordinary thread pool 1 thread per logical core 800-1500us avg: 25_000us tail 50_000us

Ordinary thread pool 2 threads per logical core 800-1500us avg: 12_000us tail 25_000us

Single thread 1800-5000us 1800-5000us

As you can see, while parallelism helps the banking stage, utilising the same threadpool leads to contention for the replay stage and results in worse performance.

As it is difficult to manually trigger page faults, I decided to simulate a single 250us page fault on every account load.

pub fn get_account<'a>(&'a self, offset: usize) -> Option<(StoredAccountMeta<'a>, usize)> { std::thread::sleep(std::time::Duration::from_micros(250));

Here are my results (32 IO threads on 8C16T machine):
This PR: 10_000-25_000us (banking). Master 350_000-500_000us.

This first result is very reasonably explained by the following fact (and by latency spikes):
2 accounts/tx * 128 tx * 250us / account / 8 threads / bank = 8000 us.

The second result is not really explained. There's a factor of 5 discrepancy.
2 accounts/tx * 128 tx * 250us / account = 64_000 us.
It may be that the threads are deprioritised as they are always not doing useful work...

Replay: This PR 25-50_000us. Master 350_000us.
Is this acceptable? Not really. We can certainly do better.

carllin · 2021-06-08T03:02:17Z

runtime/src/accounts.rs

@@ -467,6 +469,87 @@ impl Accounts {
            .collect()
    }

+    pub fn load_accounts_parallel<'a>(


Hmmm, this is not a bad idea, but I'm wondering if we can get even better perf if we scheduled the entire load + execute in pathway in parallel. Might be worth exploring 😃

Hmm, the reason why I think that's not necessarily a good idea is that the execute portion is CPU-bound.

The purpose of making the load portion parallel or async was to reduce the tail-end critical path of sequentially loading 128 TXs with up to 20-256 accounts each.

As I've mentioned elsewhere, that's a maximum critical path of 640ms-7+s on a storage device with 250us read latency. This is extremely bad for the costing model that @taozhu-chicago is working on and for security.

In general tx execution parallelism can already be handled by spawning more banking or replay threads.

I'm not familiar with how predictable the execution critical path is. But, I do know that compute unit limit is about 200K now, which if that corresponds to CPU cycles, at 60us per tx, corresponds to a critical path of 7.68ms.

Ideally, the DB was not using MMap, so that one can use something like io_uring which can achieve 1MM direct-IO (i.e. zero-copy) IOPs on single thread.

So that one does not have to manage thread scheduling, contention and switching, and can spawn async at the load granularity. The loading APIs should then be rewritten in terms of batch APIs that do async under the hood.

Batching further helps io_uring by reducing system call overhead. AccountStoredMeta deserialization can also occur in the kernel via EBPF so that one does not need to ask the application layer to do so. Deserializing to a buffer of the right size may be a challenge here, however. But, AccountInfo already contains the data size, so it shouldn't be a problem.

This neccesitates a better application-layer cache, as io_uring would be used in zero-copy/unbuffered mode.

Example API:

fn get_accounts_async_batch(keys: &[Pubkey]) -> Vec<LoadedAccount> {..} // Downstream fn load_accounts_batched(..) { for tx in batch { for key in tx { let should_load = f(key, ...); if should_load { key_batch.append(key); } should_loads.append(should_load); } } // critical latency: 250us! :-) let loaded_accounts = self.accounts_db.get_accounts_async_batch(&key_batch[..], &should_load_batch[..]); let mut access = 0; let mut loaded_access = 0; for tx in batch { for key in tx { if should_loads[access] { let account = loaded_accounts[loaded_access]; // blah // Pay rent, etc, check if executable loaded_access += 1; } access += 1; } } }

Notice that batching forces copying (into a LoadedTransaction), however. One can get rid of the copy by returning Vec<Vec<AccountSharedData>> instead.

stale · 2021-06-18T02:50:15Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2021-06-26T12:25:41Z

This stale pull request has been automatically closed. Thank you for your contributions.

jon-chuang force-pushed the jon-chuang/parallel-load branch 2 times, most recently from 9390de1 to 20b9f11 Compare June 7, 2021 05:24

accounts: parallel load across txs via core-pinned threadpool

8568d5d

jon-chuang force-pushed the jon-chuang/parallel-load branch from 20b9f11 to 8568d5d Compare June 7, 2021 05:33

jon-chuang added 2 commits June 7, 2021 14:16

obey clippy

1c6389d

max threads

f665765

carllin reviewed Jun 8, 2021

View reviewed changes

jon-chuang added 2 commits June 8, 2021 16:41

Tune parameters, set thresholds

5b26a2c

fake pass tests

1752254

jon-chuang mentioned this pull request Jun 9, 2021

Add proposal for transactions v2 and address map program #17103

Merged

mergify bot added the community Community contribution label Jun 10, 2021

stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 18, 2021

stale bot closed this Jun 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] accounts: parallel load across txs via core-pinned threadpool #17774

[draft] accounts: parallel load across txs via core-pinned threadpool #17774

jon-chuang commented Jun 7, 2021 •

edited

Loading

carllin Jun 8, 2021 •

edited

Loading

jon-chuang Jun 8, 2021 •

edited

Loading

jon-chuang Jun 10, 2021 •

edited

Loading

carllin Jun 8, 2021

jon-chuang Jun 8, 2021 •

edited

Loading

jon-chuang Jun 8, 2021 •

edited

Loading

stale bot commented Jun 18, 2021

stale bot commented Jun 26, 2021

Description	Time for 128 tx `load_accounts_parallel` (banking stage)	(replay stage)
Pin 2 threads to all logical cores	250-500us	avg: 12_000us tail: 25_000us
Pin 1 thread to all logical cores	250-500us	avg: 25_000us tail: 50_000us
Pin 4 threads to all logical cores	800-1800us	avg: 25_000us tail 50_000us
Pin 1 thread to 1/4 of logical cores	800-1800us	NIL
Ordinary thread pool 1 thread per logical core	800-1500us	avg: 25_000us tail 50_000us
Ordinary thread pool 2 threads per logical core	800-1500us	avg: 12_000us tail 25_000us
Single thread	1800-5000us	1800-5000us

[draft] accounts: parallel load across txs via core-pinned threadpool #17774

[draft] accounts: parallel load across txs via core-pinned threadpool #17774

Conversation

jon-chuang commented Jun 7, 2021 • edited Loading

Problem

Summary of Changes

carllin Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

jon-chuang Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

jon-chuang Jun 10, 2021 • edited Loading

Choose a reason for hiding this comment

carllin Jun 8, 2021

Choose a reason for hiding this comment

jon-chuang Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

jon-chuang Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

stale bot commented Jun 18, 2021

stale bot commented Jun 26, 2021

jon-chuang commented Jun 7, 2021 •

edited

Loading

carllin Jun 8, 2021 •

edited

Loading

jon-chuang Jun 8, 2021 •

edited

Loading

jon-chuang Jun 10, 2021 •

edited

Loading

jon-chuang Jun 8, 2021 •

edited

Loading

jon-chuang Jun 8, 2021 •

edited

Loading