Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPF program serialization/deserialization may no longer be necessary #14523

Open
jackcmay opened this issue Jan 10, 2021 · 14 comments
Open

BPF program serialization/deserialization may no longer be necessary #14523

jackcmay opened this issue Jan 10, 2021 · 14 comments
Assignees
Labels
research Open question / topic of discussion runtime Issues related to runtime, BPF, and LLVM
Milestone

Comments

@jackcmay
Copy link
Contributor

Now that programs no longer support returning errors "gracefully" (#14500), the way data flows to and through a cross-program invocation call chain also changes and may no longer require serialization or extra copies.

There is no longer any roll-back or intermediate discarding of data. Account data now flows through the call chain linearly from one program to the next, either to the end of the chain upon success or up until an error where the data is discarded. Because of this, rather than copying data in and out of each program's environment, the data can be passed (and translated) via pointers. Each program in the call chain incrementally operates on the data, and the runtime verifies and snapshots in-between and at the end.

Doing so provides the following advantages:

  • No serialization or deserialization of program parameters. Instead, the parameters undergo a deep pointer translation into and out of the program's virtual address space.
  • No extra copying is required to update the caller's state between calls since the called program operates directly on the shared data. Translating the data types into and out of the program is a fast operation.
  • Could support a single entrypoint between native and BPF programs
  • Operating on shared data potentially makes it easier to support account data reallocations
  • In Rust, account data would be more naturally handled and understood by the compiler. Right now we have an issue where account data is modified by the runtime on data values the program thinks is immutable (account data slice size for example). Depending on what assumptions the compiler makes some updated values may be stale in registers upon return.

But comes with the following challenges:

  • Rust does not have a stable ABI so any types translated should be done on stable ABI types (repr(C)). We already have this issue when passing Instruction and AccountIInfo via Invoke. New repr(C) types should be used for all data passed into and out of the programs. This means the current native entrypoint (using KeyedAccounts is not well suited.
  • Reallocating account data from the program space may be possible but would require coordination between the program's memory allocator and the native allocator in order to translate and route allocation requests to the native allocator.
  • Potentially lots of churn in the runtime and sdk, for example, if we replace program parameters (native and BPF) with something that is repr(C) we should probably make that change all the way down to accounts so that we don't incur any extra copies of the account data then are absolutely necessary.

Overall I think this proposal does a lot for both performance, stability, and clarity of the data being passed to and between programs.

@jackcmay jackcmay added the research Open question / topic of discussion label Jan 10, 2021
@Lichtso Lichtso self-assigned this Jan 20, 2021
@Lichtso
Copy link
Contributor

Lichtso commented Jan 20, 2021

Here is the plan:

  1. Build a custom allocator for the Account struct (probably somewhere around here) which places all accounts loaded in a TransactionBatch in a consecutive memory range.
  2. Let each transaction create a BPF VM MemoryRegion from a slice of this accounts array into the BPF VM, so that only the accounts belonging to that transaction are inside the slice.
  3. Replace the account data copy steps by using the MemoryRegion, but keep the serialization interface.
  4. Then redesign the BPF program interface to get rid of the de-/serialization.
  • Fix the immutable problem
  • Allow for better realloc of account data
  • Unify entrypoints of BPF and native programs
  1. Refactor the Account struct to be repr(C) everywhere

@jackcmay
Copy link
Contributor Author

jackcmay commented Jan 20, 2021

@Lichtso Thanks for summarizing our discussion, I have broken out some of the action items described in this issue:
#14691
#14692
#14693
#14694
#14695
#14696

@ryoqun
Copy link
Member

ryoqun commented Jan 23, 2021

You might know this already. But mem copies in the transaction execution path is really hurting performance. I found this by casually perf-ing for reviewing #14596 .

Currently, supposedly large account data consumes more cpu cycles just for moving/copying them. And, interpreted bpf execution only consumes ~6%

Note that _blake3_hash_many_avx512 will be Account's DB problem..

From #14596 (comment):

$ sudo perf top -t $(echo $(ps -e -T | grep blockstore_proc | awk '{print $2}') | sed 's/ /,/g') --proc-map-timeout 10000

Samples: 165K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 14739498376 lost: 0/0 drop: 0/0
Overhead  Shared Object       Symbol
  42.74%  libc-2.31.so        [.] __memmove_avx_unaligned_erms
   8.30%  solana-validator    [.] _blake3_hash_many_avx512
   4.45%  solana-validator    [.] solana_rbpf::vm::EbpfVm<E,I>::execute_program_interpreted_inner
   3.85%  libc-2.31.so        [.] __sched_yield
   2.87%  libc-2.31.so        [.] _int_malloc
   2.40%  libc-2.31.so        [.] __memset_avx2_erms
   2.40%  solana-validator    [.] solana_rbpf::memory_region::MemoryMapping::map
   2.13%  solana-validator    [.] solana_rbpf::ebpf::get_insn_unchecked
   1.88%  [kernel]            [k] __lock_text_start
   1.85%  [kernel]            [k] __sched_text_start
   1.66%  [kernel]            [k] do_syscall_64
   1.58%  solana-validator    [.] <&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple
   1.09%  solana-validator    [.] <crossbeam_epoch::sync::list::Iter<T,C> as core::iter::traits::iterator::Iterator>::next
   1.05%  solana-validator    [.] alloc::collections::btree::search::search_node
   0.94%  [kernel]            [k] finish_task_switch
   0.86%  libc-2.31.so        [.] malloc
   0.79%  solana-validator    [.] <<&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple::Access<R,O> as serde::de::SeqAccess>::next_element_seed
   0.71%  libc-2.31.so        [.] __memcmp_avx2_movbe
   0.63%  solana-validator    [.] _blake3_compress_in_place_avx512
   0.61%  solana-validator    [.] bs58::encode::encode_into
   0.58%  solana-validator    [.] solana_runtime::accounts_index::AccountsIndex<T>::get_account_read_entry
   0.58%  libpthread-2.31.so  [.] __pthread_rwlock_rdlock
   0.55%  solana-validator    [.] bs58::encode::encode_into
   0.54%  solana-validator    [.] <solana_rbpf::elf::EBpfElf<E,I> as solana_rbpf::vm::Executable<E,I>>::lookup_bpf_function
   0.51%  libc-2.31.so        [.] malloc_consolidate
 

... So, I'm really looking forward to the improvements in this area. :)

@jeffwashington
Copy link
Contributor

I would hope this is largely addressed by the copy on write account data change that went in a few weeks ago. @ryoqun , if it is possible for you to run this again, that would be swell. or better, if you want to point me to how to run like this. Do you have to build differently? I've not yet run perf on our stack. @ryoqun

@sakridge
Copy link
Member

sakridge commented Apr 5, 2021

I think he used this to connect while it is already running:
sudo perf top -t $(echo $(ps -e -T | grep blockstore_proc | awk '{print $2}') | sed 's/ /,/g') --proc-map-timeout 10000

If you want to start recording from the beginning:

sudo perf record -F99 --call-graph dwarf target/release/solana-ledger-tool

or Intel:

sudo perf record -F99 --call-graph lbr target/release/solana-ledger-tool

can make a flame graph:
https://github.com/sakridge/unix-home/blob/master/bin/make-flame-graph.sh

Has a hardcoded path of https://github.com/brendangregg/FlameGraph in ~/src/FlameGraph

@ryoqun
Copy link
Member

ryoqun commented Apr 15, 2021

here's updated perf top, seeing progress. (kudos to @jeffwashington ) :) This is very casual check with recent master, comparing to this: #14523 (comment)

  • I'm seeing vastly reduced memmove (42% => 15%)
  • _blake3_hash_many_avx512 disappeared
  • execute_program_interpreted_inner reduced and solana_rbpf::memory_region::MemoryMapping::map increased. maybe some code shifting? (maybe @Lichtso can have some clue?)
Samples: 156K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 5549052704 lost: 0/0 drop: 0/0                                                                                                                                  
Overhead  Shared Object                                   Symbol                                                                                                                                                                         
  15.75%  libc-2.31.so                                    [.] __memmove_avx_unaligned_erms                                                       
   8.90%  solana-validator-master-accounts-cache-race-v2  [.] <&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple
   7.89%  libc-2.31.so                                    [.] __memcmp_avx2_movbe                                                                
   5.88%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::memory_region::MemoryMapping::map
   4.58%  [kernel]                                        [k] __lock_text_start                             
   3.70%  libc-2.31.so                                    [.] _int_malloc                                                                                                                                  
   3.20%  libc-2.31.so                                    [.] __sched_yield                                                                                                                                
   3.11%  libc-2.31.so                                    [.] __memset_avx2_erms                                                                                                                           
   3.01%  solana-validator-master-accounts-cache-race-v2  [.] <<&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_tuple::Access<R,O> as serde::de::SeqAccess>::next_element_seed                          
   2.23%  solana-validator-master-accounts-cache-race-v2  [.] alloc::collections::btree::search::<impl alloc::collections::btree::node::NodeRef<BorrowType,K,V,alloc::collections::btree::node::marker::LeafOrInternal>>::search_tree
   2.05%  [kernel]                                        [k] __sched_text_start                                                                                                                                                     
   1.86%  solana-validator-master-accounts-cache-race-v2  [.] <crossbeam_epoch::sync::list::Iter<T,C> as core::iter::traits::iterator::Iterator>::next
   1.50%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::vm::EbpfVm<E,I>::execute_program_interpreted_inner                                                                      
   1.46%  libc-2.31.so                                    [.] malloc                                                                                                                               
   1.32%  solana-validator-master-accounts-cache-race-v2  [.] <serde::de::impls::<impl serde::de::Deserialize for alloc::vec::Vec<T>>::deserialize::VecVisitor<T> as serde::de::Visitor>::visit_seq
   1.20%  solana-validator-master-accounts-cache-race-v2  [.] bs58::encode::encode_into                                                                                                            
   0.83%  solana-validator-master-accounts-cache-race-v2  [.] crossbeam_epoch::default::pin                                                           
   0.75%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::ebpf::get_insn_unchecked                                                                                                                                      
   0.74%  [kernel]                                        [k] do_syscall_64                                                            
   0.65%  [kernel]                                        [k] finish_task_switch                                                                      
   0.65%  solana-validator-master-accounts-cache-race-v2  [.] <dashmap::DashMap<K,V,S> as dashmap::t::Map<K,V,S>>::_get                                                                                                                  
   0.63%  libc-2.31.so                                    [.] unlink_chunk.isra.0                                                                                                                                                        
   0.61%  libpthread-2.31.so                              [.] __pthread_rwlock_rdlock                                                                                                                                                    
   0.60%  solana-validator-master-accounts-cache-race-v2  [.] <&mut bincode::ser::Serializer<W,O> as serde::ser::Serializer>::serialize_newtype_struct                                                                                   
   0.59%  solana-validator-master-accounts-cache-race-v2  [.] crossbeam_deque::deque::Stealer<T>::steal                                                                                                                                  
   0.58%  libc-2.31.so                                    [.] _int_free                                                                                                                                                                  
   0.58%  libc-2.31.so                                    [.] cfree@GLIBC_2.2.5                                                                                                                                                          
   0.56%  solana-validator-master-accounts-cache-race-v2  [.] <std::collections::hash::map::DefaultHasher as core::hash::Hasher>::write                                                                                                  
   0.56%  [kernel]                                        [k] do_sched_yield                                                                                                                                                             
   0.53%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_index::AccountsIndex<T>::latest_slot                                                                                                              
   0.48%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_index::AccountsIndex<T>::get_account_read_entry                     
   0.42%  libc-2.31.so                                    [.] malloc_consolidate                                                                                                                                                         
   0.35%  solana-validator-master-accounts-cache-race-v2  [.] crossbeam_channel::waker::SyncWaker::notify                                                                                                                                
   0.35%  solana-validator-master-accounts-cache-race-v2  [.] <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold                                                                              
   0.35%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_cache::AccountsCache::load                                                                                                                        
   0.34%  solana-validator-master-accounts-cache-race-v2  [.] rayon_core::registry::WorkerThread::wait_until_cold                                                                                                                        
   0.34%  solana-validator-master-accounts-cache-race-v2  [.] serde::de::SeqAccess::next_element                                                                                                                                         
   0.32%  solana-validator-master-accounts-cache-race-v2  [.] hashbrown::map::make_hash                                                                                                                                                  
   0.32%  solana-validator-master-accounts-cache-race-v2  [.] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once                                                                                    
   0.31%  solana-validator-master-accounts-cache-race-v2  [.] solana_vote_program::vote_state::VoteState::process_vote                                                                             
   0.30%  solana-validator-master-accounts-cache-race-v2  [.] solana_runtime::accounts_db::LoadedAccount::take_account                               

@jeffwashington
Copy link
Contributor

I suspect blake3_hash disappearing has at least 2 known causes:

  1. @brooksprumo renamed it ;-)
  2. this pr moved account hashing to the bg and only for the most recent store of an account. There are still a few places where hashing shows up in the metrics. I knew this pr improved experienced clock time. I hadn't seen it as a profiler line item disappear. This is cool.

@Lichtso
Copy link
Contributor

Lichtso commented Apr 15, 2021

I suspect execute_program_interpreted_inner went down as we switched to JIT being enabled by default.
Though, I have no idea why solana_rbpf::memory_region::MemoryMapping::map increased so much.
What were the versions (especially of solana_rbpf) when these measurements took place?

@jeffwashington
Copy link
Contributor

   0.53%  [.] solana_runtime::accounts_index::AccountsIndex<T>::latest_slot                                                                                                              

latest_slot is being attacked by an already-submitted pr described in this issue. @ryoqun ran into the assert gathering these metrics today and had to remove the pr that should have greatly reduced this %.

@ryoqun
Copy link
Member

ryoqun commented Apr 16, 2021

What were the versions (especially of solana_rbpf) when these measurements took place?

@Lichtso Hmm, unfortunately I don't have precise build info. I only compared between the tip of master around Jan 21 and one around Apr 16. This is very informal test so I don't want to make you worry too much. :) If you're interested, I can run this test between tip of master and v1.5.18. Also, the on-chain tx composition can change the perf metrics a lot between the time. So, I'm just looking this for fun and to ensure nothing so obviously bad is happening on tip of master. Anyway, I'm hoping your massive work #15410 will improve this perf numbers a lot. :)

@Lichtso
Copy link
Contributor

Lichtso commented Apr 16, 2021

@Lichtso Hmm, unfortunately I don't have precise build info. I only compared between the tip of master around Jan 21 and one around Apr 16. This is very informal test so I don't want to make you worry too much. :) If you're interested, I can run this test between tip of master and v1.5.18. Also, the on-chain tx composition can change the perf metrics a lot between the time. So, I'm just looking this for fun and to ensure nothing so obviously bad is happening on tip of master. Anyway, I'm hoping your massive work #15410 will improve this perf numbers a lot. :)

Thx, but now that I looked at the data again I think solana_rbpf::memory_region::MemoryMapping::map has not changed at all (in absolute terms). Everything else got better, so now it takes more of the cake in relative terms. If you take a look at the ratio of that and the next rank you will see what I mean:

2.40%  solana-validator    [.] solana_rbpf::memory_region::MemoryMapping::map
1.88%  [kernel]            [k] __lock_text_start
Ratio: 1.27659574468

5.88%  solana-validator-master-accounts-cache-race-v2  [.] solana_rbpf::memory_region::MemoryMapping::map
4.58%  [kernel]                                        [k] __lock_text_start
Ratio: 1.28384279476

Also, the refactoring I am working on shouldn't change much about the performance if at all.
However, it is a requirement for some of the planned more advanced optimizations.

@Lichtso
Copy link
Contributor

Lichtso commented Jul 23, 2021

To summarize how this progressed during the past half year:

@Lichtso
Copy link
Contributor

Lichtso commented Jul 30, 2021

@jackcmay, @jeffwashington, @aeyakovenko and I just had a recap discussion about all the related / connected issues and came up with the following design goals:

About half of these ideas existed already and I linked the tracking IDs. I will try to come up with a concrete proposal for how the new interface should look like by next week and post it here as request for comments.

@dmakarov dmakarov self-assigned this Aug 3, 2021
@jackcmay
Copy link
Contributor Author

jackcmay commented Aug 5, 2021

In addition, I think we can also address the following issue as well when we roll out the new loader that doesn't rely on serialization:
#9711

@pgarg66 pgarg66 added the runtime Issues related to runtime, BPF, and LLVM label Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Open question / topic of discussion runtime Issues related to runtime, BPF, and LLVM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants