Switch programs activation to whole-set based gating #11750

jackcmay · 2020-08-20T21:34:48Z

Problem

get_entered_epoch_callback isn't called when restoring from snapshots, which is too confusing and error-prone.
Also, we can't simply call it immediately after snapshot restoration because it expects to be called exactly once at each epoch boundary

get_programs and get_builtins returns delta set. i.e. add these new additions of programs to the current available set at the given epoch. This works nicely in the ideal world, where we're running the validator since genesis without ever restarting a perfect bug-free validator.

In reality, we must rely on snapshots 99.999% of time. When restoring from snapshots, the delta set doesn't work quite: we don't persist the current available set (namely bank.message_processor is effectively serde(skip)).

Summary of Changes

So, just reflect the reality by making these functions snapshot-friendly by returning whole-set of available programs at the given epoch. And make it callable from finish_init(), which is called after snapshot restoration.

Also, fix a bunch of other dangerous code along the way.

Also, this is intended to be back-port friendly; so the fix is intentionally not exhaustive. Still, get_entered_epoch_callback is a bit error-prone. Specifically, it must be idempotent. (We could solve this by artificially introducing some intermediate struct like ScheduledBankFeatures or the like instead of mind-opening way of passing &mut Bank).

Deprecates #11736

jackcmay · 2020-08-20T22:44:41Z

@ryoqun I have some concerns with these changes, at every bank we are checking for the existence of the account and calling message_processor to override the program's process_instruction. We can probably at least separate the program activation mechanism from the generalized bank APIs to add built-ins.

Maybe genesis programs can have a list of (program, activation_epoch)s that is used by the epoch boundary callback to check if current_epoch == activation_epoch but provides a different callback for finish_init that does 'if current_epoch >= activation_epoch` in order to populate with all the activated programs.

Also, is there any reason not to make the hot fix permanent, seems the bank is the best place for that logic?

And, any reason to keep inflation in genesis-programs? Maybe we should also move that to runtime like the builtin programs are in runtime/src/builtins.rs?

codecov · 2020-08-20T23:09:28Z

Codecov Report

Merging #11750 into master will increase coverage by 0.0%.
The diff coverage is 86.8%.

@@           Coverage Diff            @@
##           master   #11750    +/-   ##
========================================
  Coverage    82.0%    82.0%            
========================================
  Files         330      330            
  Lines       77174    77464   +290     
========================================
+ Hits        63334    63589   +255     
- Misses      13840    13875    +35

ryoqun · 2020-08-21T02:12:54Z

runtime/src/bank.rs

+        if parent.epoch() < new.epoch() {
+            new.refresh_programs_and_inflation();
+        }


ryoqun · 2020-08-21T02:15:08Z

@ryoqun I have some concerns with these changes, at every bank we are checking for the existence of the account and calling message_processor to override the program's process_instruction.

I don't think we're doing at every bank. I think we're doing the checks only when crossing the epoch and when restoring from snapshots: https://github.com/solana-labs/solana/pull/11750/files#r474369115 and notice that finish_init isn't called from Bank::new_from_parent.

ryoqun · 2020-08-21T02:17:32Z

And, any reason to keep inflation in genesis-programs? Maybe we should also move that to runtime like the builtin programs are in runtime/src/builtins.rs?

Yeah, I'll move it as well. it's a bit odd for inflation to reside at the current place...

ryoqun · 2020-08-21T02:27:55Z

Maybe genesis programs can have a list of (program, activation_epoch)s that is used by the epoch boundary callback to check if current_epoch == activation_epoch but provides a different callback for finish_init that does 'if current_epoch >= activation_epoch` in order to populate with all the activated programs.

I think this approach will work too. But I originally avoided to do so because it'd make the activation mechanism a bit less flexible. On the other hand, this would lift the idempotent requirement which is currently required under refresh_programs_and_inflation. Anyway, I'll consider the approach a bit.

Also, note that I had this in mind: the most important objective here is make it super-simple to do add/tweak gating logic to avoid similar bugs in the foreseeable future.

FYI, we're doing the current checks at epoch boundaries already: #11750 (comment)

jackcmay · 2020-08-21T02:49:39Z

@ryoqun I have some concerns with these changes, at every bank we are checking for the existence of the account and calling message_processor to override the program's process_instruction.

I don't think we're doing at every bank. I think we're doing the checks only when crossing the epoch and when restoring from snapshots: https://github.com/solana-labs/solana/pull/11750/files#r474369115 and notice that finish_init isn't called from Bank::new_from_parent.

Yeah, sorry I didn't mean every bank, every epoch, but at every epoch we are re-adding these programs, kinda weird

jackcmay · 2020-08-21T02:52:10Z

Can you elaborate on how the (program, activation_epoch) would be less flexible?

ryoqun · 2020-08-21T03:24:08Z

Yeah, sorry I didn't mean every bank, every epoch, but at every epoch we are re-adding these programs, kinda weird

Yeah, I want to avoid the weird behavior as well, too. I originally thought that the behavior wouldn't occur because Bank::add_builtin_loader(Bank::add_builtin) ensures not to re-add accounts. But, I noticed that Bank::add_native_program does unconditionally add accounts...

I wonder why there is subtle behavior differences between the two similar functions. I'll just align Bank::add_native_program to Bank::add_builtin_loader.

Can you elaborate on how the (program, activation_epoch) would be less flexible?

It's less flexible in that you can't specify arbitrary condition like with closures. So, say we can't conditionally trigger new programs with bank.capitalization() like that. But, it's unlikely we'd ever need that flexibility.

Anyway, thanks for the inputs. Because now I have some more time, I'll put more changes to fix them more properly.

ryoqun · 2020-08-21T09:29:09Z

runtime/src/bank.rs

        *self.entered_epoch_callback.write().unwrap() = Some(entered_epoch_callback);
+        self.apply_feature_activations();


Sadly, this is needed in addition to call at finish_init.

ryoqun · 2020-08-21T09:32:49Z

runtime/src/bank.rs

        new.update_epoch_stakes(leader_schedule_epoch);
        new.ancestors.insert(new.slot(), 0);
        new.parents().iter().enumerate().for_each(|(i, p)| {
            new.ancestors.insert(p.slot(), i + 1);
        });
+        if parent.epoch() < new.epoch() {


This code is moved because of use of get_account inside there.

jackcmay · 2020-08-21T15:19:50Z

sdk/src/entrypoint_native.rs

@@ -175,6 +175,20 @@ pub type ProcessInstruction = fn(&Pubkey, &[KeyedAccount], &[u8]) -> Result<(),
 pub type ProcessInstructionWithContext =
    fn(&Pubkey, &[KeyedAccount], &[u8], &mut dyn InvokeContext) -> Result<(), InstructionError>;

+// These are just type aliases for work around of Debug-ing above function pointers


Does this mean these are temporary?

Yeah, I hope these are temporary. But, it'll take some time (a year?) considering the duration the upstream bug has been opening: rust-lang/rust#50280

If these are concerning, I'll try another workaround or remove workaround at all. I just want to dbg! some types containing these for quick debugging. This work-around isn't a must; just nice-to-have.

ryoqun · 2020-08-21T16:21:45Z

Well, the current (EDIT: base commit of this pr) code doesn't populate message_processor.loaders at all if restored from snapshots... I wonder why the current cluster is working..

master when booted from genesis (without snapshot) => 2 bpf loaders:

[runtime/src/bank.rs:587] &new.message_processor = MessageProcessor { programs: ["11111111111111111111111111111111: 0x563895613300", "Config1111111111111111111111111111111111111: 0x563895867a80", "Stake11111111111111111111111111111111111111: 0x5638958224e0", "Vote111111111111111111111111111111111111111: 0x56389584eb10"], loaders: ["BPFLoader2111111111111111111111111111111111: 0x5638952eabd0", "BPFLoader1111111111111111111111111111111111: 0x5638952eabd0"], native_loader: NativeLoader { program_symbol_cache: RwLock { data: {} }, loader_symbol_cache: RwLock { data: {} } }, is_cross_program_supported: true }

master when booted from snapshot (with snapshot) => no loaders:

[runtime/src/bank.rs:587] &new.message_processor = MessageProcessor { programs: ["11111111111111111111111111111111: 0x55dad6dc2300", "Config1111111111111111111111111111111111111: 0x55dad7016a80", "Stake11111111111111111111111111111111111111: 0x55dad6fd14e0", "Vote111111111111111111111111111111111111111: 0x55dad6ffdb10"], loaders: [], native_loader: NativeLoader { program_symbol_cache: RwLock { data: {} }, loader_symbol_cache: RwLock { data: {} } }, is_cross_program_supported: true }

Also, it seems that the message_processor.loaders is always empty regardless of snapshot or no snapshot on v1.2...

jackcmay · 2020-08-21T16:46:40Z

We need more snapshot tests ;-)

ryoqun · 2020-08-21T16:53:59Z

@jackcmay As for backporting this, if our ultimate goal is to land bpf_loader2 on v1.2, I think we're better off back-porting the prs first, not this one. I just noticed the amount of bpf2 is rather large. Back-porting in git-reversed order (= start to backport this pr now after merging) would be quite pain for this size of changes for me and you. ;)

If we can't backport individually those relevant prs due to known bug (maybe this?), we can create a pr against v1.2 and cherry-pick them in the git-chronological order, which will lessen the degree of merge conflicts.

ryoqun · 2020-08-21T16:56:03Z

Yeah, I'll move it as well. it's a bit odd for inflation to reside at the current place...

Also, I'll postpone this as a separate pr later. Inflation doesn't change on v1.2. And this will just make this pr (already moderate sized) bigger for no good reason.

runtime/src/bank.rs

jackcmay · 2020-08-21T17:49:39Z

runtime/src/bank.rs

+        self.recheck_cross_program_support();
+    }
+
+    fn recheck_cross_program_support(self: &mut Bank) {


why recheck and not just check?

not strong reason; just to convey the idempotent connotation. If this sounds odd, I'm completely fine with fn check_....

runtime/src/bank.rs

ryoqun · 2020-08-21T18:42:25Z

Ugh, I think I've found dos secrutiy... We aren't considering BPFLoader2111111111111111111111111111111111 is pre-occupied by malicious user....:

system accounts with some lamports (=> may cause capitalization off)
Stake1111111 (=> our cache invalidation is working correctly?)
Fake native loader account (=> are we always checking the account.owner before doing dlopen with its account.data?)

ryoqun · 2020-08-21T18:43:24Z

Anyway, I think we need sysvar-like treatment for existing accounts.

ryoqun · 2020-08-21T18:46:24Z

Also, it seems that the message_processor.loaders is always empty regardless of snapshot or no snapshot on v1.2...

I'm guessing now message_processor.loaders has been broken for some very long time. Only after #11516, the emptiness problem surfaced on master.

jackcmay · 2020-08-21T20:32:00Z

@ryoqun Can you elaborate more on empty loaders issue you are seeing?

runtime/src/bank.rs

ryoqun · 2020-08-21T21:45:37Z

@ryoqun Can you elaborate more on empty loaders issue you are seeing?

There are some cases where message_processor.loaders is empty. For example, on v1.2, it's always empty even if we're starting from genesis or snapshot. I think that field should hold bpf_loader at least... But it's working correctly in that validator can process program's transactions. Maybe, dynamic link libbpf_loader.so fallbacks?

For v1.3, I haven't tested.

For HEAD of master at the moment (ie without this pr), message_processor.loaders can be empty when restored from snapshot and causes panic when program's transactions are encountered. When running from genesis, message_processor.loaders isn't empty.

With this pr, message_process.loaders is always with bpf_loader and bpf_loader2. And validator can process program's transactions.

All, can be tested with ./run.sh.

ryoqun · 2020-08-21T21:51:35Z

oh conflict...

ryoqun · 2020-08-21T21:52:28Z

Maybe genesis programs can have a list of (program, activation_epoch)s ...

Ok, tests are in place, finally! I'll address this.

genesis-programs/src/lib.rs

jackcmay · 2020-08-24T17:19:13Z

runtime/src/bank.rs

+        if let Some(mut account) = self.get_account(&program_id) {
+            already_genuine_program_exists = native_loader::check_id(&account.owner);
+
+            if !already_genuine_program_exists {


Bring on the fire and brimstone.... what if this account has a very high balance and is obviously worth a lot to someone. Seems like instead or along with we should block known addresses from being added in the first place?

Also, what if someone adds an account and assigns it to the native_loader?

Bring on the fire and brimstone....

Hehe, I like thinking hard for corner cases... :)

Also, what if someone adds an account and assigns it to the native_loader?

Yeah, I thought of it. Well, can someone assign accounts to non-existing owner program_id (=NativeLoader1111111111111111111111111111111)?:

ryoqun@ubuqun:~/work/solana/solana$ ./target/release/solana --url http://api.mainnet-beta.solana.com account BPFLoader1111111111111111111111111111111111 Public Key: BPFLoader1111111111111111111111111111111111 Balance: 0.000000001 SOL Owner: NativeLoader1111111111111111111111111111111 Executable: true Rent Epoch: 0 Length: 25 (0x19) bytes 0000: 73 6f 6c 61 6e 61 5f 62 70 66 5f 6c 6f 61 64 65 solana_bpf_loade 0010: 72 5f 70 72 6f 67 72 61 6d r_program ryoqun@ubuqun:~/work/solana/solana$ ./target/release/solana --url http://api.mainnet-beta.solana.com account NativeLoader1111111111111111111111111111111 Error: AccountNotFound: pubkey=NativeLoader1111111111111111111111111111111

I just blindly thought the original code was covering the case that with mere this condition: native_loader::check_id(&account.owner).

If someone can do the wild, we can add account.executable condition to discern legitimate native loader program from the spoofed native loader program?

what if this account has a very high balance and is obviously worth a lot to someone. Seems like instead or along with we should block known addresses from being added in the first place?

;) Yeah, this could be quite a draconian. But, just to be sure, is this really protecting someone from misfortune? Well, people can give lump of sols into wrong account with 1 letter typo... Also, people can send to any of sysvars...

Anyway, I think we should need a preparatory pr to prohibit money transfer to & from some pretty long prefix or suffix? Considering new sysvar or new bpf loaders are rare, this might be a too much. Also, finding nice base58-friendly prefix/postfix seems hard....

Or, I'm also thinking to practice of 2-phased introduction of sysvars/loaders like this:
1(a). Release a patch release which just prohibits any money transfer to certain account at certain epoch.
1(b). Or, just create an account with Placeholder11111 with the executable bit? (might be easiest?)
1(c). Add BPFLoader11111 to frozen account.... (no, this is bad, dos vector)
2. release the acutual meat

jackcmay · 2020-08-24T17:20:09Z

runtime/src/bank.rs

-        self.store_account(program_id, &account);
+        let mut already_genuine_program_exists = false;
+        if let Some(mut account) = self.get_account(&program_id) {
+            already_genuine_program_exists = native_loader::check_id(&account.owner);


What does genuine mean here, could just be is_native

Good call. :) Not strongly-opinionated word choice, I wanted to emphasize that an native-loader's accounts can be good or bad. And this is really trying to indicate this variable holds a flag for the good native loader program's account.

I just picked this word from these opposing word pools:

real, legitimate, genuine <=> spoofed/crafted/malicious

ryoqun · 2020-08-24T19:03:59Z

@jackcmay Thanks for continuous review! I think I've addressed all of review comments. Could you finish off reviewing this so that I can merge this? :)

ryoqun · 2020-08-24T19:04:42Z

Maybe genesis programs can have a list of (program, activation_epoch)s ...

Ok, tests are in place, finally! I'll address this.

This is done: ebba5bd

genesis-programs/src/lib.rs

jackcmay · 2020-08-25T05:47:30Z

Besides that one last comment about the bank config, this looks great, thanks @ryoqun !

ryoqun · 2020-08-25T17:00:30Z

@jackcmay I've just merged this after somewhat extensive local-testing!

* Implement Debug for MessageProcessor * Switch from delta-based gating to whole-set gating * Remove dbg! * Fix clippy * Clippy * Add test * add loader to stable operating mode at proper epoch * refresh_programs_and_inflation after ancestor setup * Callback via snapshot; avoid account re-add; Debug * Fix test * Fix test and fix the past history * Make callback management stricter and cleaner * Fix test * Test overwrite and frozen for native programs * Test epoch callback with genesis-programs * Add assertions for parent bank * Add tests and some minor cleaning * Remove unsteady assertion... * Fix test... * Fix DOS * Skip ensuring account by dual (whole/delta) gating * Fix frozen abi implementation... * Move compute budget constatnt init back into bank Co-authored-by: Ryo Onodera <[email protected]> (cherry picked from commit db4bbb3) # Conflicts: # genesis-programs/src/lib.rs

) * Switch programs activation to whole-set based gating (#11750) * Implement Debug for MessageProcessor * Switch from delta-based gating to whole-set gating * Remove dbg! * Fix clippy * Clippy * Add test * add loader to stable operating mode at proper epoch * refresh_programs_and_inflation after ancestor setup * Callback via snapshot; avoid account re-add; Debug * Fix test * Fix test and fix the past history * Make callback management stricter and cleaner * Fix test * Test overwrite and frozen for native programs * Test epoch callback with genesis-programs * Add assertions for parent bank * Add tests and some minor cleaning * Remove unsteady assertion... * Fix test... * Fix DOS * Skip ensuring account by dual (whole/delta) gating * Fix frozen abi implementation... * Move compute budget constatnt init back into bank Co-authored-by: Ryo Onodera <[email protected]> (cherry picked from commit db4bbb3) # Conflicts: # genesis-programs/src/lib.rs * Fix conflicts Co-authored-by: Jack May <[email protected]> Co-authored-by: Ryo Onodera <[email protected]>

ryoqun · 2020-12-04T09:16:27Z

runtime/src/bank.rs

+
+        bank.add_builtin_program("mock_program1", vote_id, mock_ix_processor);
+        bank.add_builtin_program("mock_program2", stake_id, mock_ix_processor);
+        assert!(bank.stakes.read().unwrap().vote_accounts().is_empty());


well these assertions isn't enough.. https://github.com/solana-labs/solana/pull/13884/files#r535949045

jackcmay mentioned this pull request Aug 20, 2020

Switch programs activation to whole-set based gating #11736

Closed

ryoqun reviewed Aug 21, 2020

View reviewed changes

jackcmay commented Aug 21, 2020

View reviewed changes

ryoqun reviewed Aug 21, 2020

View reviewed changes

runtime/src/bank.rs Show resolved Hide resolved

jackcmay commented Aug 21, 2020

View reviewed changes

runtime/src/bank.rs Show resolved Hide resolved

jackcmay mentioned this pull request Aug 21, 2020

The constraints on compute power a program can consume is limited only to its instruction count #11717

Merged

ryoqun reviewed Aug 21, 2020

View reviewed changes

runtime/src/bank.rs Outdated Show resolved Hide resolved

ryoqun reviewed Aug 21, 2020

View reviewed changes

genesis-programs/src/lib.rs Show resolved Hide resolved

ryoqun added 3 commits August 24, 2020 14:31

Test epoch callback with genesis-programs

9160dc6

Add assertions for parent bank

7c26698

Add tests and some minor cleaning

5699303

ryoqun force-pushed the fragile-programs-gating branch from 04d8b80 to 5699303 Compare August 24, 2020 06:42

ryoqun added 3 commits August 24, 2020 16:31

Remove unsteady assertion...

9529fc4

Fix test...

19bee79

Fix DOS

c1ba3d2

jackcmay commented Aug 24, 2020

View reviewed changes

ryoqun added 2 commits August 25, 2020 02:48

Skip ensuring account by dual (whole/delta) gating

ebba5bd

Fix frozen abi implementation...

39201b8

jackcmay commented Aug 25, 2020

View reviewed changes

genesis-programs/src/lib.rs Outdated Show resolved Hide resolved

Move compute budget constatnt init back into bank

b6615fb

ryoqun merged commit db4bbb3 into master Aug 25, 2020

ryoqun deleted the fragile-programs-gating branch August 25, 2020 17:07

This was referenced Aug 25, 2020

v1.3: Upgrade loader ABI on testnet #11569

Closed

v1.3: Upgrade loader on mainnet-beta #11568

Closed

ryoqun added the v1.3 label Aug 25, 2020

mergify bot mentioned this pull request Aug 25, 2020

Switch programs activation to whole-set based gating (bp #11750) #11837

Merged

This was referenced Aug 31, 2020

Forbid bogus account assignment to native_loaders #11928

Closed

Use conventional special self notation #11990

Merged

ryoqun mentioned this pull request Dec 4, 2020

Simple cap by including sysvars and native programs #13884

Merged

ryoqun reviewed Dec 4, 2020

View reviewed changes

ryoqun mentioned this pull request Dec 10, 2020

Tiny add_native_program bug fixes with cleanups #14042

Merged

		*self.entered_epoch_callback.write().unwrap() = Some(entered_epoch_callback);
		self.apply_feature_activations();

Switch programs activation to whole-set based gating #11750

Switch programs activation to whole-set based gating #11750

Conversation

jackcmay commented Aug 20, 2020 • edited by ryoqun Loading

Problem

Summary of Changes

jackcmay commented Aug 20, 2020 • edited Loading

codecov bot commented Aug 20, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

ryoqun commented Aug 21, 2020

ryoqun commented Aug 21, 2020

ryoqun commented Aug 21, 2020

jackcmay commented Aug 21, 2020

jackcmay commented Aug 21, 2020

ryoqun commented Aug 21, 2020

Choose a reason for hiding this comment

ryoqun Aug 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun commented Aug 21, 2020 • edited Loading

jackcmay commented Aug 21, 2020

ryoqun commented Aug 21, 2020

ryoqun commented Aug 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun commented Aug 21, 2020

ryoqun commented Aug 21, 2020

ryoqun commented Aug 21, 2020

jackcmay commented Aug 21, 2020

ryoqun commented Aug 21, 2020

ryoqun commented Aug 21, 2020

ryoqun commented Aug 21, 2020

jackcmay Aug 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun Aug 24, 2020 • edited Loading

Choose a reason for hiding this comment

ryoqun commented Aug 24, 2020

ryoqun commented Aug 24, 2020

jackcmay commented Aug 25, 2020

ryoqun commented Aug 25, 2020

Choose a reason for hiding this comment

jackcmay commented Aug 20, 2020 •

edited by ryoqun

Loading

jackcmay commented Aug 20, 2020 •

edited

Loading

codecov bot commented Aug 20, 2020 •

edited

Loading

ryoqun Aug 21, 2020 •

edited

Loading

ryoqun commented Aug 21, 2020 •

edited

Loading

jackcmay Aug 24, 2020 •

edited

Loading

ryoqun Aug 24, 2020 •

edited

Loading