-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[draft] accounts: parallel load across txs via core-pinned threadpool #17774
[draft] accounts: parallel load across txs via core-pinned threadpool #17774
Conversation
9390de1
to
20b9f11
Compare
20b9f11
to
8568d5d
Compare
} | ||
let id_for_spawn = self.cores[self.core_id_pointer]; | ||
b.spawn(move || { | ||
core_affinity::set_for_current(id_for_spawn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on in a comment what the problem we're trying to solve with pinning the threads/benefit of pinning these threads to particular cores would be (cache locality?) Also perhaps a new bench in runtime/benches/accounts.rs
to demonstrate this difference?
Check no perf regression when running bench TPS
Might also be a good idea to try running this on a full validator to see if there are any unexpected drawdowns in other components of the validator. For instance I think the PoH thread is pinned to DEFAULT_PINNED_CPU_CORE: usize = 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @carllin
My apologies that the thoughts are a little jumbled and still evolving on the issue. Will eventually summarise investigations in a report of sorts - there is a larger question of avoiding mmap, as do most modern SSD-oriented DBs. And it ought to at least be spelled out what the potential benefits are.
The biggest benefit may be to reduce memory requirements while improving the worst-case critical path (this is related to some security, performance, and pricing issues, such as the tx cost model).
This is important if considering a persistent index. I think the persistent index should be written using io_uring and fallback on polling disk read syscalls handled via async/await.
Pinning the threads in particular was just an idea to reduce resource contention, so that the loading threads don’t eat up the resources of the entire CPU.
However, I’ve found by measuring the time to perform load_accounts_parallel
when running bench-tps that pinning more than 2 threads per core makes things worse. But, pinning 2 threads to each CPU rather than to a subset of the CPUs is actually the best option for the thread pool (250-500us for 128 tx batch). One can avoid core 0 for PoH.
Note that in the replay stage, a different scenario arises. One can no longer “cheat” by parallelising across threads. That’s because the default behaviour is to already saturate the threads with load_and_execute
. What this means is that essentially, one has the overhead of submitting to rayon but none of the benefits.
The fundamental problem about spinning up threads is that one doesn't know in advance when one is going to trigger a page fault. Spinning up a thread when MMap page hits is wasteful. But not spinning up a thread when there is a fault is going to increase the latency. The solution of asking a scheduler like Rayon is imperfect at best - as seen from the results below, there isn't a solution that works in all cases, and the worst-case critical path is also not fully mitigated.
Ideally, the application layer knows when data is cached and when it has to hit disk - and can thus spin up a thread as needed, or have an async runtime than doesn't even require spinning up more OS threads, rather than relying on the OS as a black box that can arbitrarily stall a thread's execution.
In other words, having the I/O operation occur at the application/kernel boundary is much better than having it occur fully in the kernel layer.
Results - fully in RAM (approximate from eyeballing):
Description | Time for 128 tx load_accounts_parallel (banking stage) |
(replay stage) |
---|---|---|
Pin 2 threads to all logical cores | 250-500us | avg: 12_000us tail: 25_000us |
Pin 1 thread to all logical cores | 250-500us | avg: 25_000us tail: 50_000us |
Pin 4 threads to all logical cores | 800-1800us | avg: 25_000us tail 50_000us |
Pin 1 thread to 1/4 of logical cores | 800-1800us | NIL |
Ordinary thread pool 1 thread per logical core | 800-1500us | avg: 25_000us tail 50_000us |
Ordinary thread pool 2 threads per logical core | 800-1500us | avg: 12_000us tail 25_000us |
Single thread | 1800-5000us | 1800-5000us |
As you can see, while parallelism helps the banking stage, utilising the same threadpool leads to contention for the replay stage and results in worse performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is difficult to manually trigger page faults, I decided to simulate a single 250us page fault on every account load.
pub fn get_account<'a>(&'a self, offset: usize) -> Option<(StoredAccountMeta<'a>, usize)> {
std::thread::sleep(std::time::Duration::from_micros(250));
Here are my results (32 IO threads on 8C16T machine):
This PR: 10_000-25_000us
(banking). Master 350_000-500_000us
.
This first result is very reasonably explained by the following fact (and by latency spikes):
2 accounts/tx * 128 tx * 250us / account / 8 threads / bank = 8000 us.
The second result is not really explained. There's a factor of 5 discrepancy.
2 accounts/tx * 128 tx * 250us / account = 64_000 us.
It may be that the threads are deprioritised as they are always not doing useful work...
Replay: This PR 25-50_000us
. Master 350_000us
.
Is this acceptable? Not really. We can certainly do better.
@@ -467,6 +469,87 @@ impl Accounts { | |||
.collect() | |||
} | |||
|
|||
pub fn load_accounts_parallel<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, this is not a bad idea, but I'm wondering if we can get even better perf if we scheduled the entire load + execute in pathway in parallel. Might be worth exploring 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the reason why I think that's not necessarily a good idea is that the execute portion is CPU-bound.
The purpose of making the load portion parallel or async was to reduce the tail-end critical path of sequentially loading 128 TXs with up to 20-256 accounts each.
As I've mentioned elsewhere, that's a maximum critical path of 640ms-7+s on a storage device with 250us read latency. This is extremely bad for the costing model that @taozhu-chicago is working on and for security.
In general tx execution parallelism can already be handled by spawning more banking or replay threads.
I'm not familiar with how predictable the execution critical path is. But, I do know that compute unit limit is about 200K now, which if that corresponds to CPU cycles, at 60us per tx, corresponds to a critical path of 7.68ms.
Ideally, the DB was not using MMap, so that one can use something like io_uring which can achieve 1MM direct-IO (i.e. zero-copy) IOPs on single thread.
So that one does not have to manage thread scheduling, contention and switching, and can spawn async at the load granularity. The loading APIs should then be rewritten in terms of batch APIs that do async under the hood.
Batching further helps io_uring by reducing system call overhead. AccountStoredMeta
deserialization can also occur in the kernel via EBPF so that one does not need to ask the application layer to do so. Deserializing to a buffer of the right size may be a challenge here, however. But, AccountInfo
already contains the data size, so it shouldn't be a problem.
This neccesitates a better application-layer cache, as io_uring would be used in zero-copy/unbuffered mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example API:
fn get_accounts_async_batch(keys: &[Pubkey]) -> Vec<LoadedAccount> {..}
// Downstream
fn load_accounts_batched(..) {
for tx in batch {
for key in tx {
let should_load = f(key, ...);
if should_load {
key_batch.append(key);
}
should_loads.append(should_load);
}
}
// critical latency: 250us! :-)
let loaded_accounts = self.accounts_db.get_accounts_async_batch(&key_batch[..], &should_load_batch[..]);
let mut access = 0;
let mut loaded_access = 0;
for tx in batch {
for key in tx {
if should_loads[access] {
let account = loaded_accounts[loaded_access];
// blah
// Pay rent, etc, check if executable
loaded_access += 1;
}
access += 1;
}
}
}
Notice that batching forces copying (into a LoadedTransaction
), however. One can get rid of the copy by returning Vec<Vec<AccountSharedData>>
instead.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale pull request has been automatically closed. Thank you for your contributions. |
Problem
Currently,
load_and_execute_transactions
loads sequentially within bank. This means the worst case latency for multiple sequential page faults can stall throughput.Summary of Changes
Parallelise loads across TXs using an over-provisioned thread pool pinned to a fixed number of logical cores. The thread-pool is over-provisioned (by default by 8x) to deal with OS blocking threads when page faulting.
Fixes: #17765 #17761
CC (for early feedback): @sakridge @carllin
TODO:
Related: