Reject blocks for costs above the max block cost #18994

tao-stones · 2021-07-30T15:24:56Z

Problem

We don't want a block that is too large (say takes more than 400ms to process).
While producer limits block size using best-effort approach with cost model, validator should reject blocks as soon as it determined its cost exceeds max limit, including those blocks with malicious intention.

Summary of Changes

added realtime cost checking logic to reject block that would exceed max limit:

defines max limits at block_cost_limits.rs, also removed associated cost for signer for now as it is not well defined, and makes very small difference.
right after each bath's execution, accumulate its cost and check again
limit, return error if limit is exceeded
add feature gate_large_block for this change
moved cost const def into block_cost_limits.rs

this is actually removed without activation: #29598

Fixes #18944
Fixes #18993

tao-stones · 2021-07-30T17:36:40Z

broke a coverage test ... looking into it ... was due to abi change resulted from adding additional TransactionError. Updated abi in bank.rs.

codecov · 2021-07-30T21:41:54Z

Codecov Report

Merging #18994 (af1befb) into master (fd93754) will decrease coverage by 0.0%.
The diff coverage is 84.7%.

@@            Coverage Diff            @@
##           master   #18994     +/-   ##
=========================================
- Coverage    82.9%    82.8%   -0.1%     
=========================================
  Files         450      454      +4     
  Lines      128336   128818    +482     
=========================================
+ Hits       106411   106766    +355     
- Misses      21925    22052    +127

ledger/src/blockstore_processor.rs

aeyakovenko · 2021-08-05T04:50:00Z

@taozhu-chicago @sakridge is this totally deterministic and based on some globally defined threshold?

tao-stones · 2021-08-05T12:58:32Z

@taozhu-chicago @sakridge is this totally deterministic and based on some globally defined threshold?

the Max Block Limit is a fixed number, determined as the "worst case scenario", eg. = worst_instruction_compute_unit * max_instruction_per_block, observed from mainnet today.

carllin · 2021-08-05T21:13:45Z

the Max Block Limit is a fixed number, determined as the "worst case scenario", eg. = worst_instruction_compute_unit * max_instruction_per_block, observed from mainnet today.

@taozhu-chicago, is the worst_instruction_compute_unit consistent across all the validators as well, i.e. they are all using the same cost function?

tao-stones · 2021-08-06T00:10:59Z

the Max Block Limit is a fixed number, determined as the "worst case scenario", eg. = worst_instruction_compute_unit * max_instruction_per_block, observed from mainnet today.

@taozhu-chicago, is the worst_instruction_compute_unit consistent across all the validators as well, i.e. they are all using the same cost function?

Yes, computer_unit is calculated by @jackcmay 's tx_wide_compute_cap feature, at https://github.com/solana-labs/solana/blob/master/runtime/src/message_processor.rs#L1276-L1308
It should be consistent cross nodes.

sakridge · 2021-08-10T14:18:08Z

core/src/replay_stage.rs

@@ -1949,6 +1948,12 @@ impl ReplayStage {
            }
            assert_eq!(*bank_slot, bank.slot());
            if bank.is_complete() {
+                execute_timings.accumulate(&bank_progress.replay_stats.execute_timings);


So if !bank.is_complete(), then it just sends completely empty execute_timings?

yea, it'd send empty execute_timings if no bank is complete in this loop. However, on the receiving side, empty timings are ignored (https://github.com/solana-labs/solana/blob/master/core/src/cost_update_service.rs#L140-L143).

Why waste the cycles of putting the thing in the channel and receiving if it just ignores it though? Hard to follow that kind of logic too, where it does extra work and nothing happens.

Can we not just send at the end of the block?

yea, fair enough. Originally thought was to keep the flow uniform, at cost of wasted cycles. Can add a small check at the end of function before sending

tao-stones · 2021-08-10T16:27:40Z

in determining the compute-unit for signature and read/write account ops:

say max block time = 400ms / block
on average, a instruction takes 1ms, with 10X parallelism, max instruction per block = ( max block time / 1ms ) * 10 = 4,000
max block cost (cu) = max instruction per block * max instruction cost (cu) = 4,000 * 200,000 = 800,000,000
therefore get CU / us ratio = max block cost (cu) / ( 1000 * max block time ) = 2,000

signature takes 10us, with CU/us ratio, it comes to 20K CU, which equals max 40K signatures per block;
read account takes 5us, with CU/us ratio, it comes to 10K CU, which equals max 80K reads per block;
write account takes 25us, with CU/us ratio, it comes to 50K CU, which equals max 16K writes per block.

…max limit: - defines max limits at block_cost_limits.rs - right after each bath's execution, accumulate its cost and check again limit, return error if limit is exceeded

… when a bank is completed

…rt as it is not well defined for now

sakridge

lgtm

behzadnouri · 2021-08-16T20:08:16Z

@taozhu-chicago There is a 4x regression in GCE 1 Hour perf stability - 5 nodes, 1 region, CPU only job, and git bisect is pointing to this change. Would you please take a look?

Also this is the 2nd time cost-model has caused regressions in that job; So it seems like benign looking cost-model changes can easily introduce performance regressions. In the future, can you please run the perf-test before merging such changes?

tao-stones · 2021-08-16T20:39:28Z

@taozhu-chicago There is a 4x regression in GCE 1 Hour perf stability - 5 nodes, 1 region, CPU only job, and git bisect is pointing to this change. Would you please take a look?

Also this is the 2nd time cost-model has caused regressions in that job; So it seems like benign looking cost-model changes can easily introduce performance regressions. In the future, can you please run the perf-test before merging such changes?

@behzadnouri thanks for bringing it up. Will pay closer attention to the perf-test. I'll look into the regression now

* Cost Model to limit transactions which are not parallelizeable (#16694) * * Add following to banking_stage: 1. CostModel as immutable ref shared between threads, to provide estimated cost for transactions. 2. CostTracker which is shared between threads, tracks transaction costs for each block. * replace hard coded program ID with id() calls * Add Account Access Cost as part of TransactionCost. Account Access cost are weighted differently between read and write, signed and non-signed. * Establish instruction_execution_cost_table, add function to update or insert instruction cost, unit tested. It is read-only for now; it allows Replay to insert realtime instruction execution costs to the table. * add test for cost_tracker atomically try_add operation, serves as safety guard for future changes * check cost against local copy of cost_tracker, return transactions that would exceed limit as unprocessed transaction to be buffered; only apply bank processed transactions cost to tracker; * bencher to new banking_stage with max cost limit to allow cost model being hit consistently during bench iterations * replay stage feed back program cost (#17731) * replay stage feeds back realtime per-program execution cost to cost model; * program cost execution table is initialized into empty table, no longer populated with hardcoded numbers; * changed cost unit to microsecond, using value collected from mainnet; * add ExecuteCostTable with fixed capacity for security concern, when its limit is reached, programs with old age AND less occurrence will be pushed out to make room for new programs. * investigate system performance test degradation (#17919) * Add stats and counter around cost model ops, mainly: - calculate transaction cost - check transaction can fit in a block - update block cost tracker after transactions are added to block - replay_stage to update/insert execution cost to table * Change mutex on cost_tracker to RwLock * removed cloning cost_tracker for local use, as the metrics show clone is very expensive. * acquire and hold locks for block of TXs, instead of acquire and release per transaction; * remove redundant would_fit check from cost_tracker update execution path * refactor cost checking with less frequent lock acquiring * avoid many Transaction_cost heap allocation when calculate cost, which is in the hot path - executed per transaction. * create hashmap with new_capacity to reduce runtime heap realloc. * code review changes: categorize stats, replace explicit drop calls, concisely initiate to default * address potential deadlock by acquiring locks one at time * Persist cost table to blockstore (#18123) * Add `ProgramCosts` Column Family to blockstore, implement LedgerColumn; add `delete_cf` to Rocks * Add ProgramCosts to compaction excluding list alone side with TransactionStatusIndex in one place: `excludes_from_compaction()` * Write cost table to blockstore after `replay_stage` replayed active banks; add stats to measure persist time * Deletes program from `ProgramCosts` in blockstore when they are removed from cost_table in memory * Only try to persist to blockstore when cost_table is changed. * Restore cost table during validator startup * Offload `cost_model` related operations from replay main thread to dedicated service thread, add channel to send execute_timings between these threads; * Move `cost_update_service` to its own module; replay_stage is now decoupled from cost_model. * log warning when channel send fails (#18391) * Aggregate cost_model into cost_tracker (#18374) * * aggregate cost_model into cost_tracker, decouple it from banking_stage to prevent accidental deadlock. * Simplified code, removed unused functions * review fixes * update ledger tool to restore cost table from blockstore (#18489) * update ledger tool to restore cost model from blockstore when compute-slot-cost * Move initialize_cost_table into cost_model, so the function can be tested and shared between validator and ledger-tool * refactor and simplify a test * manually fix merge conflicts * Per-program id timings (#17554) * more manual fixing * solve a merge conflict * featurize cost model * more merge fix * cost model uses compute_unit to replace microsecond as cost unit (#18934) * Reject blocks for costs above the max block cost (#18994) * Update block max cost limit to fix performance regession (#19276) * replace function with const var for better readability (#19285) * Add few more metrics data points (#19624) * periodically report sigverify_stage stats (#19674) * manual merge * cost model nits (#18528) * Accumulate consumed units (#18714) * tx wide compute budget (#18631) * more manual merge * ignore zerorize drop security * - update const cost values with data collected by #19627 - update cost calculation to closely proposed fee schedule #16984 * add transaction cost histogram metrics (#20350) * rebase to 1.7.15 * add tx count and thread id to stats (#20451) each stat reports and resets when slot changes * remove cost_model feature_set * ignore vote transactions from cost model Co-authored-by: sakridge <[email protected]> Co-authored-by: Jeff Biseda <[email protected]> Co-authored-by: Jack May <[email protected]>

tao-stones requested review from sakridge and jackcmay and removed request for jackcmay July 30, 2021 15:25

tao-stones mentioned this pull request Jul 30, 2021

Include account access costs into realtime cost gating #18995

Closed

tao-stones marked this pull request as draft July 30, 2021 17:36

tao-stones marked this pull request as ready for review July 30, 2021 19:16

carllin reviewed Aug 2, 2021

View reviewed changes

ledger/src/blockstore_processor.rs Outdated Show resolved Hide resolved

tao-stones requested a review from carllin August 3, 2021 19:34

tao-stones force-pushed the reject_too_large_blocks branch 2 times, most recently from aa858e6 to 8e89674 Compare August 4, 2021 23:30

sakridge reviewed Aug 10, 2021

View reviewed changes

tao-stones added 7 commits August 10, 2021 16:49

added realtime cost checking logic to reject block that would exceed …

8448223

…max limit: - defines max limits at block_cost_limits.rs - right after each bath's execution, accumulate its cost and check again limit, return error if limit is exceeded

update abi that changed due to adding additional TransactionError

888ef92

To avoid counting stats mltiple times, only accumulate execute-timing…

210eb4e

… when a bank is completed

gate it by a feature

b87effd

move cost const def into block_cost_limits.rs

9de8fa5

redefine the cost for signature and account access, removed signer pa…

5456e2a

…rt as it is not well defined for now

check if per_program_timings of execute_timings before sending

af1befb

tao-stones force-pushed the reject_too_large_blocks branch from a3e08b4 to af1befb Compare August 10, 2021 21:49

tao-stones requested a review from sakridge August 11, 2021 01:11

sakridge approved these changes Aug 12, 2021

View reviewed changes

tao-stones merged commit 414d904 into solana-labs:master Aug 12, 2021

tao-stones mentioned this pull request Aug 17, 2021

Transaction Cost updates #19275

Closed

tao-stones added a commit to tao-stones/solana that referenced this pull request Sep 24, 2021

Reject blocks for costs above the max block cost (solana-labs#18994)

6b4dafa

tao-stones added a commit to tao-stones/solana that referenced this pull request Sep 24, 2021

Reject blocks for costs above the max block cost (solana-labs#18994)

bf52d3a

tao-stones added a commit to tao-stones/solana that referenced this pull request Sep 29, 2021

Reject blocks for costs above the max block cost (solana-labs#18994)

e1af026

tao-stones added a commit to tao-stones/solana that referenced this pull request Sep 30, 2021

Reject blocks for costs above the max block cost (solana-labs#18994)

e1793f5

tao-stones added a commit to tao-stones/solana that referenced this pull request Oct 6, 2021

Reject blocks for costs above the max block cost (solana-labs#18994)

5e54e89

ryoqun mentioned this pull request Feb 20, 2023

remove dated cost checking feature #29598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject blocks for costs above the max block cost #18994

Reject blocks for costs above the max block cost #18994

tao-stones commented Jul 30, 2021 •

edited by ryoqun

Loading

tao-stones commented Jul 30, 2021 •

edited

Loading

codecov bot commented Jul 30, 2021 •

edited

Loading

aeyakovenko commented Aug 5, 2021

tao-stones commented Aug 5, 2021

carllin commented Aug 5, 2021

tao-stones commented Aug 6, 2021

sakridge Aug 10, 2021 •

edited

Loading

tao-stones Aug 10, 2021

sakridge Aug 10, 2021

tao-stones Aug 10, 2021

tao-stones commented Aug 10, 2021 •

edited

Loading

sakridge left a comment

behzadnouri commented Aug 16, 2021

tao-stones commented Aug 16, 2021

Reject blocks for costs above the max block cost #18994

Reject blocks for costs above the max block cost #18994

Conversation

tao-stones commented Jul 30, 2021 • edited by ryoqun Loading

Problem

Summary of Changes

tao-stones commented Jul 30, 2021 • edited Loading

codecov bot commented Jul 30, 2021 • edited Loading

Codecov Report

aeyakovenko commented Aug 5, 2021

tao-stones commented Aug 5, 2021

carllin commented Aug 5, 2021

tao-stones commented Aug 6, 2021

sakridge Aug 10, 2021 • edited Loading

Choose a reason for hiding this comment

tao-stones Aug 10, 2021

Choose a reason for hiding this comment

sakridge Aug 10, 2021

Choose a reason for hiding this comment

tao-stones Aug 10, 2021

Choose a reason for hiding this comment

tao-stones commented Aug 10, 2021 • edited Loading

sakridge left a comment

Choose a reason for hiding this comment

behzadnouri commented Aug 16, 2021

tao-stones commented Aug 16, 2021

tao-stones commented Jul 30, 2021 •

edited by ryoqun

Loading

tao-stones commented Jul 30, 2021 •

edited

Loading

codecov bot commented Jul 30, 2021 •

edited

Loading

sakridge Aug 10, 2021 •

edited

Loading

tao-stones commented Aug 10, 2021 •

edited

Loading