BPF program serialization/deserialization may no longer be necessary #14523

jackcmay · 2021-01-10T15:42:42Z

Now that programs no longer support returning errors "gracefully" (#14500), the way data flows to and through a cross-program invocation call chain also changes and may no longer require serialization or extra copies.

There is no longer any roll-back or intermediate discarding of data. Account data now flows through the call chain linearly from one program to the next, either to the end of the chain upon success or up until an error where the data is discarded. Because of this, rather than copying data in and out of each program's environment, the data can be passed (and translated) via pointers. Each program in the call chain incrementally operates on the data, and the runtime verifies and snapshots in-between and at the end.

Doing so provides the following advantages:

No serialization or deserialization of program parameters. Instead, the parameters undergo a deep pointer translation into and out of the program's virtual address space.
No extra copying is required to update the caller's state between calls since the called program operates directly on the shared data. Translating the data types into and out of the program is a fast operation.
Could support a single entrypoint between native and BPF programs
Operating on shared data potentially makes it easier to support account data reallocations
In Rust, account data would be more naturally handled and understood by the compiler. Right now we have an issue where account data is modified by the runtime on data values the program thinks is immutable (account data slice size for example). Depending on what assumptions the compiler makes some updated values may be stale in registers upon return.

But comes with the following challenges:

Rust does not have a stable ABI so any types translated should be done on stable ABI types (repr(C)). We already have this issue when passing Instruction and AccountIInfo via Invoke. New repr(C) types should be used for all data passed into and out of the programs. This means the current native entrypoint (using KeyedAccounts is not well suited.
Reallocating account data from the program space may be possible but would require coordination between the program's memory allocator and the native allocator in order to translate and route allocation requests to the native allocator.
Potentially lots of churn in the runtime and sdk, for example, if we replace program parameters (native and BPF) with something that is repr(C) we should probably make that change all the way down to accounts so that we don't incur any extra copies of the account data then are absolutely necessary.

Overall I think this proposal does a lot for both performance, stability, and clarity of the data being passed to and between programs.

The text was updated successfully, but these errors were encountered:

Lichtso · 2021-01-20T10:45:02Z

Here is the plan:

Build a custom allocator for the Account struct (probably somewhere around here) which places all accounts loaded in a TransactionBatch in a consecutive memory range.
Let each transaction create a BPF VM MemoryRegion from a slice of this accounts array into the BPF VM, so that only the accounts belonging to that transaction are inside the slice.
Replace the account data copy steps by using the MemoryRegion, but keep the serialization interface.
Then redesign the BPF program interface to get rid of the de-/serialization.

Fix the immutable problem
Allow for better realloc of account data
Unify entrypoints of BPF and native programs

Refactor the Account struct to be repr(C) everywhere

jackcmay · 2021-01-20T15:15:41Z

@Lichtso Thanks for summarizing our discussion, I have broken out some of the action items described in this issue:
#14691
#14692
#14693
#14694
#14695
#14696

ryoqun · 2021-01-23T00:38:57Z

You might know this already. But mem copies in the transaction execution path is really hurting performance. I found this by casually perf-ing for reviewing #14596 .

Currently, supposedly large account data consumes more cpu cycles just for moving/copying them. And, interpreted bpf execution only consumes ~6%

Note that _blake3_hash_many_avx512 will be Account's DB problem..

From #14596 (comment):

$ sudo perf top -t $(echo $(ps -e -T | grep blockstore_proc | awk '{print $2}') | sed 's/ /,/g') --proc-map-timeout 10000

Samples: 165K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 14739498376 lost: 0/0 drop: 0/0
Overhead  Shared Object       Symbol
  42.74%  libc-2.31.so        [.] __memmove_avx_unaligned_erms
   8.30%  solana-validator    [.] _blake3_hash_many_avx512
   4.45%  solana-validator    [.] solana_rbpf::vm::EbpfVm<E,I>::execute_program_interpreted_inner
   3.85%  libc-2.31.so        [.] __sched_yield
   2.87%  libc-2.31.so        [.] _int_malloc
   2.40%  libc-2.31.so        [.] __memset_avx2_erms
   2.40%  solana-validator    [.] solana_rbpf::memory_region::MemoryMapping::map
   2.13%  solana-validator    [.] solana_rbpf::ebpf::get_insn_unchecked
   1.88%  [kernel]            [k] __lock_text_start
   1.85%  [kernel]            [k] __sched_text_start
   1.66%  [kernel]            [k] do_syscall_64
   1.58%  solana-validator    [.] <&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple
   1.09%  solana-validator    [.] <crossbeam_epoch::sync::list::Iter<T,C> as core::iter::traits::iterator::Iterator>::next
   1.05%  solana-validator    [.] alloc::collections::btree::search::search_node
   0.94%  [kernel]            [k] finish_task_switch
   0.86%  libc-2.31.so        [.] malloc
   0.79%  solana-validator    [.] <<&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple::Access<R,O> as serde::de::SeqAccess>::next_element_seed
   0.71%  libc-2.31.so        [.] __memcmp_avx2_movbe
   0.63%  solana-validator    [.] _blake3_compress_in_place_avx512
   0.61%  solana-validator    [.] bs58::encode::encode_into
   0.58%  solana-validator    [.] solana_runtime::accounts_index::AccountsIndex<T>::get_account_read_entry
   0.58%  libpthread-2.31.so  [.] __pthread_rwlock_rdlock
   0.55%  solana-validator    [.] bs58::encode::encode_into
   0.54%  solana-validator    [.] <solana_rbpf::elf::EBpfElf<E,I> as solana_rbpf::vm::Executable<E,I>>::lookup_bpf_function
   0.51%  libc-2.31.so        [.] malloc_consolidate

... So, I'm really looking forward to the improvements in this area. :)

jeffwashington · 2021-04-05T15:46:22Z

I would hope this is largely addressed by the copy on write account data change that went in a few weeks ago. @ryoqun , if it is possible for you to run this again, that would be swell. or better, if you want to point me to how to run like this. Do you have to build differently? I've not yet run perf on our stack. @ryoqun

sakridge · 2021-04-05T15:59:07Z

I think he used this to connect while it is already running:
sudo perf top -t $(echo $(ps -e -T | grep blockstore_proc | awk '{print $2}') | sed 's/ /,/g') --proc-map-timeout 10000

If you want to start recording from the beginning:

sudo perf record -F99 --call-graph dwarf target/release/solana-ledger-tool

or Intel:

sudo perf record -F99 --call-graph lbr target/release/solana-ledger-tool

can make a flame graph:
https://github.com/sakridge/unix-home/blob/master/bin/make-flame-graph.sh

Has a hardcoded path of https://github.com/brendangregg/FlameGraph in ~/src/FlameGraph

ryoqun · 2021-04-15T14:31:21Z

here's updated perf top, seeing progress. (kudos to @jeffwashington ) :) This is very casual check with recent master, comparing to this: #14523 (comment)

I'm seeing vastly reduced memmove (42% => 15%)
_blake3_hash_many_avx512 disappeared
execute_program_interpreted_inner reduced and solana_rbpf::memory_region::MemoryMapping::map increased. maybe some code shifting? (maybe @Lichtso can have some clue?)

Samples: 156K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 5549052704 lost: 0/0 drop: 0/0                                                                                                                                  
Overhead  Shared Object                                   Symbol                                                                                                                                                                         
  15.75%  libc-2.31.so                                    [.] __memmove_avx_unaligned_erms                                                       
   8.90%  solana-validator-master-accounts-cache-race-v2  [.] <&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple
   7.89%  libc-2.31.so                                    [.] __memcmp_avx2_movbe                                                                
   5.88%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::memory_region::MemoryMapping::map
   4.58%  [kernel]                                        [k] __lock_text_start                             
   3.70%  libc-2.31.so                                    [.] _int_malloc                                                                                                                                  
   3.20%  libc-2.31.so                                    [.] __sched_yield                                                                                                                                
   3.11%  libc-2.31.so                                    [.] __memset_avx2_erms                                                                                                                           
   3.01%  solana-validator-master-accounts-cache-race-v2  [.] <<&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple::Access<R,O> as serde::de::SeqAccess>::next_element_seed                          
   2.23%  solana-validator-master-accounts-cache-race-v2  [.] alloc::collections::btree::search::<impl alloc::collections::btree::node::NodeRef<BorrowType,K,V,alloc::collections::btree::node::marker::LeafOrInternal>>::search_tree
   2.05%  [kernel]                                        [k] __sched_text_start                                                                                                                                                     
   1.86%  solana-validator-master-accounts-cache-race-v2  [.] <crossbeam_epoch::sync::list::Iter<T,C> as core::iter::traits::iterator::Iterator>::next
   1.50%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::vm::EbpfVm<E,I>::execute_program_interpreted_inner                                                                      
   1.46%  libc-2.31.so                                    [.] malloc                                                                                                                               
   1.32%  solana-validator-master-accounts-cache-race-v2  [.] <serde::de::impls::<impl serde::de::Deserialize for alloc::vec::Vec<T>>::deserialize::VecVisitor<T> as serde::de::Visitor>::visit_seq
   1.20%  solana-validator-master-accounts-cache-race-v2  [.] bs58::encode::encode_into                                                                                                            
   0.83%  solana-validator-master-accounts-cache-race-v2  [.] crossbeam_epoch::default::pin                                                           
   0.75%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::ebpf::get_insn_unchecked                                                                                                                                      
   0.74%  [kernel]                                        [k] do_syscall_64                                                            
   0.65%  [kernel]                                        [k] finish_task_switch                                                                      
   0.65%  solana-validator-master-accounts-cache-race-v2  [.] <dashmap::DashMap<K,V,S> as dashmap::t::Map<K,V,S>>::_get                                                                                                                  
   0.63%  libc-2.31.so                                    [.] unlink_chunk.isra.0                                                                                                                                                        
   0.61%  libpthread-2.31.so                              [.] __pthread_rwlock_rdlock                                                                                                                                                    
   0.60%  solana-validator-master-accounts-cache-race-v2  [.] <&mut bincode::ser::Serializer<W,O> as serde::ser::Serializer>::serialize_newtype_struct                                                                                   
   0.59%  solana-validator-master-accounts-cache-race-v2  [.] crossbeam_deque::deque::Stealer<T>::steal                                                                                                                                  
   0.58%  libc-2.31.so                                    [.] _int_free                                                                                                                                                                  
   0.58%  libc-2.31.so                                    [.] cfree@GLIBC_2.2.5                                                                                                                                                          
   0.56%  solana-validator-master-accounts-cache-race-v2  [.] <std::collections::hash::map::DefaultHasher as core::hash::Hasher>::write                                                                                                  
   0.56%  [kernel]                                        [k] do_sched_yield                                                                                                                                                             
   0.53%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_index::AccountsIndex<T>::latest_slot                                                                                                              
   0.48%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_index::AccountsIndex<T>::get_account_read_entry                     
   0.42%  libc-2.31.so                                    [.] malloc_consolidate                                                                                                                                                         
   0.35%  solana-validator-master-accounts-cache-race-v2  [.] crossbeam_channel::waker::SyncWaker::notify                                                                                                                                
   0.35%  solana-validator-master-accounts-cache-race-v2  [.] <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold                                                                              
   0.35%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_cache::AccountsCache::load                                                                                                                        
   0.34%  solana-validator-master-accounts-cache-race-v2  [.] rayon_core::registry::WorkerThread::wait_until_cold                                                                                                                        
   0.34%  solana-validator-master-accounts-cache-race-v2  [.] serde::de::SeqAccess::next_element                                                                                                                                         
   0.32%  solana-validator-master-accounts-cache-race-v2  [.] hashbrown::map::make_hash                                                                                                                                                  
   0.32%  solana-validator-master-accounts-cache-race-v2  [.] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once                                                                                    
   0.31%  solana-validator-master-accounts-cache-race-v2  [.] solana_vote_program::vote_state::VoteState::process_vote                                                                             
   0.30%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_db::LoadedAccount::take_account

jeffwashington · 2021-04-15T17:27:47Z

I suspect blake3_hash disappearing has at least 2 known causes:

@brooksprumo renamed it ;-)
this pr moved account hashing to the bg and only for the most recent store of an account. There are still a few places where hashing shows up in the metrics. I knew this pr improved experienced clock time. I hadn't seen it as a profiler line item disappear. This is cool.

Lichtso · 2021-04-15T17:40:24Z

I suspect execute_program_interpreted_inner went down as we switched to JIT being enabled by default.
Though, I have no idea why solana_rbpf::memory_region::MemoryMapping::map increased so much.
What were the versions (especially of solana_rbpf) when these measurements took place?

jeffwashington · 2021-04-15T19:07:16Z

   0.53%  [.] solana_runtime::accounts_index::AccountsIndex<T>::latest_slot

latest_slot is being attacked by an already-submitted pr described in this issue. @ryoqun ran into the assert gathering these metrics today and had to remove the pr that should have greatly reduced this %.

ryoqun · 2021-04-16T12:59:45Z

What were the versions (especially of solana_rbpf) when these measurements took place?

@Lichtso Hmm, unfortunately I don't have precise build info. I only compared between the tip of master around Jan 21 and one around Apr 16. This is very informal test so I don't want to make you worry too much. :) If you're interested, I can run this test between tip of master and v1.5.18. Also, the on-chain tx composition can change the perf metrics a lot between the time. So, I'm just looking this for fun and to ensure nothing so obviously bad is happening on tip of master. Anyway, I'm hoping your massive work #15410 will improve this perf numbers a lot. :)

Lichtso · 2021-04-16T13:42:16Z

@Lichtso Hmm, unfortunately I don't have precise build info. I only compared between the tip of master around Jan 21 and one around Apr 16. This is very informal test so I don't want to make you worry too much. :) If you're interested, I can run this test between tip of master and v1.5.18. Also, the on-chain tx composition can change the perf metrics a lot between the time. So, I'm just looking this for fun and to ensure nothing so obviously bad is happening on tip of master. Anyway, I'm hoping your massive work #15410 will improve this perf numbers a lot. :)

Thx, but now that I looked at the data again I think solana_rbpf::memory_region::MemoryMapping::map has not changed at all (in absolute terms). Everything else got better, so now it takes more of the cake in relative terms. If you take a look at the ratio of that and the next rank you will see what I mean:

2.40%  solana-validator    [.] solana_rbpf::memory_region::MemoryMapping::map
1.88%  [kernel]            [k] __lock_text_start
Ratio: 1.27659574468

5.88%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::memory_region::MemoryMapping::map
4.58%  [kernel]                                        [k] __lock_text_start
Ratio: 1.28384279476

Also, the refactoring I am working on shouldn't change much about the performance if at all.
However, it is a requirement for some of the planned more advanced optimizations.

Lichtso · 2021-07-23T17:49:43Z

To summarize how this progressed during the past half year:

Refactoring: Move KeyedAccounts to InvokeContext #15410 Concentrated the accounts metadata (KeyedAccount) used by the VM in once place (the InvokeContext)
Refactoring: Unify account_deps and accounts #17898 Got rid of the account_deps special cases and restricted the use of pre_accounts to validation only, in turn rendering them obsolete once accounts can be write-protected by memory mapping.
Bumps solana_rbpf to v0.2.14 #18869 The new constant-time-complexity address translation opens up a path for mapping accounts directly from the DB without copying them into a consecutive buffer first, by mapping each account to its own VM MemoryRegion.

Lichtso · 2021-07-30T18:44:44Z

@jackcmay, @jeffwashington, @aeyakovenko and I just had a recap discussion about all the related / connected issues and came up with the following design goals:

Differentiate new ABI / program entrypoint by new loader ID.
Must have: Bidirectional calling between new and old programs.
Nice to have: Extensible metadata in entrypoint, so fields can be added / removed more easily in the future.
Nice to have: Let native programs and BPF programs share a single entrypoint (Native and non-native programs implement two different entrypoints #14695).
Stable ABI using repr(C) and using correct mut and const attributes (Rust's AccountInfo contains immutable members that might be changed by the runtime or other programs #14692).
Respect Rust’s FFI rules when calling into/out of program (Rust BPF programs do not interface with the runtime via FFI-safe types #14693).
Redesign invoke syscall to only take the instruction, not accounts.
Remove use of RefCell, maybe an index (indirection) to account mapping.
Remove ser/de of accounts, map them directly using MemoryRegions in the VM (Program serialization is inefficient due to copies of accounts into and out of the program #14691).
- Meta data stays in input MemoryRegion, referencing accounts with translated pointers.
- Accounts go in a MemoryRegion each.
Remove preaccounts (copy and validation) because MemoryRegions can be marked read-only.
Make account data reallocatable / resizable without tricking the rust mechanisms (BPF programs have a limited ability to realloc account data sizes #14694).
- Currently needs to invoke system program to (re)allocate account.
- Instead add an explicit realloc instruction for resizing accounts (without extra copy).

About half of these ideas existed already and I linked the tracking IDs. I will try to come up with a concrete proposal for how the new interface should look like by next week and post it here as request for comments.

jackcmay · 2021-08-05T16:00:03Z

In addition, I think we can also address the following issue as well when we roll out the new loader that doesn't rely on serialization:
#9711

jackcmay added the research Open question / topic of discussion label Jan 10, 2021

Lichtso self-assigned this Jan 20, 2021

This was referenced Jan 20, 2021

Program serialization is inefficient due to copies of accounts into and out of the program #14691

Open

Rust BPF programs do not interface with the runtime via FFI-safe types #14693

Open

This was referenced Jan 20, 2021

Native and non-native programs implement two different entrypoints #14695

Open

Accounts incur a copy when passed to invoke that may not be neccissary #14696

Closed

ryoqun mentioned this issue Apr 5, 2021

reduce lock contention on latest_root #16306

Merged

mvines added this to the The Future! milestone May 10, 2021

Lichtso mentioned this issue Jul 23, 2021

Const integer bounds approach to bounds checking solana-labs/rbpf#194

Closed

dmakarov self-assigned this Aug 3, 2021

Lichtso mentioned this issue Aug 27, 2021

Index loaders / executable accounts #19469

Merged

Lichtso mentioned this issue Sep 11, 2021

ABI v2 Proposal #19191

Closed

pgarg66 added the runtime Issues related to runtime, BPF, and LLVM label Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPF program serialization/deserialization may no longer be necessary #14523

BPF program serialization/deserialization may no longer be necessary #14523

jackcmay commented Jan 10, 2021

Lichtso commented Jan 20, 2021

jackcmay commented Jan 20, 2021 •

edited

Loading

ryoqun commented Jan 23, 2021

jeffwashington commented Apr 5, 2021

sakridge commented Apr 5, 2021

ryoqun commented Apr 15, 2021

jeffwashington commented Apr 15, 2021

Lichtso commented Apr 15, 2021

jeffwashington commented Apr 15, 2021

ryoqun commented Apr 16, 2021 •

edited

Loading

Lichtso commented Apr 16, 2021

Lichtso commented Jul 23, 2021

Lichtso commented Jul 30, 2021

jackcmay commented Aug 5, 2021

BPF program serialization/deserialization may no longer be necessary #14523

BPF program serialization/deserialization may no longer be necessary #14523

Comments

jackcmay commented Jan 10, 2021

Lichtso commented Jan 20, 2021

jackcmay commented Jan 20, 2021 • edited Loading

ryoqun commented Jan 23, 2021

jeffwashington commented Apr 5, 2021

sakridge commented Apr 5, 2021

ryoqun commented Apr 15, 2021

jeffwashington commented Apr 15, 2021

Lichtso commented Apr 15, 2021

jeffwashington commented Apr 15, 2021

ryoqun commented Apr 16, 2021 • edited Loading

Lichtso commented Apr 16, 2021

Lichtso commented Jul 23, 2021

Lichtso commented Jul 30, 2021

jackcmay commented Aug 5, 2021

jackcmay commented Jan 20, 2021 •

edited

Loading

ryoqun commented Apr 16, 2021 •

edited

Loading