investigate system performance test degradation #17919

tao-stones · 2021-06-13T18:18:39Z

Problem

System test has poorer TPS with cost model

Summary of Changes

add new metrics around cost model operations
change mutex<cost_tracker> to RwLock<cost_tracker>
removed steps to clone cost_tacker for local use - clone is expensive
acquire and hold locks for batch of transactions, instead of doing so for each transaction.
remove redundant would_fit check from cost_tracker update execution path
add calculate_cost_no_alloc to avoid heap allocation of transaction_cost for each call. calculate_cost is in hot path, many heap allocation can quickly add up to be very expensive.

All above are refactors, no logic change.

Fixes #

codecov · 2021-06-13T22:13:58Z

Codecov Report

❗ No coverage uploaded for pull request base (master@8167827). Click here to learn what that means.
The diff coverage is 87.4%.

❗ Current head d8ddb79 differs from pull request most recent head 6239429. Consider uploading reports for the commit 6239429 to get more accurate results

@@            Coverage Diff            @@
##             master   #17919   +/-   ##
=========================================
  Coverage          ?    82.3%           
=========================================
  Files             ?      434           
  Lines             ?   121050           
  Branches          ?        0           
=========================================
  Hits              ?    99711           
  Misses            ?    21339           
  Partials          ?        0

behzadnouri

Can you please update the pull-request description? "Summary of changes" is confusing.

On a quick first look it seems like:

there are new metrics for more granular timings.
switching from Mutex to RwLock for cost_tracker.
new calculate_cost_no_alloc function.

Are there any nuance logic change as well?

core/src/banking_stage.rs

core/src/cost_model.rs

tao-stones · 2021-06-21T22:40:14Z

Can you please update the pull-request description? "Summary of changes" is confusing.

On a quick first look it seems like:

there are new metrics for more granular timings.

switching from Mutex to RwLock for cost_tracker.

new calculate_cost_no_alloc function.

Are there any nuance logic change as well?

Thanks for pointing out, summarized commit comments into description to make it clearer.

core/src/banking_stage.rs

behzadnouri · 2021-06-24T15:47:52Z

core/src/banking_stage.rs

+        let cost_model_readonly = cost_model.read().unwrap();
+        let mut cost_tracker_mutable = cost_tracker.write().unwrap();


I see a potential for dead lock here, should someone unknowingly lock these 2 in different order somewhere else in the code. Can we mitigate this some how? or at least document the order these should be locked.

Thanks for spotting it. Acquiring both locks was the result of moving locking from inside loop to outside, hence creating the condition for potential deadlock. There are few options to break it, such as acquire read lock, do all calculation, then acquire write lock to write the costs; but it'd requires allocating a big chunk of heap. I need to test out to see which option comes with acceptable perf impact.

As a baseline, I reverted this blcok of code to acquire-use-release instead of holding the locks to avoid potential deadlock:

1 let mut cost_tracking_time = Measure::start("cost_tracking_time"); 2 let mut tx_cost = TransactionCost::new_with_capacity(MAX_WRITABLE_ACCOUNTS); 3 { 4 //let cost_model_readonly = cost_model.read().unwrap(); 5 //let mut cost_tracker_mutable = cost_tracker.write().unwrap(); 6 transactions.iter().enumerate().for_each(|(index, tx)| { 7 if !unprocessed_tx_indexes.iter().any(|&i| i == index) { 8 cost_model.read().unwrap().calculate_cost_no_alloc(&tx.transaction(), &mut tx_cost); 9 cost_tracker.write().unwrap().add_transaction( 10 &tx_cost.writable_accounts, 11 &(tx_cost.account_access_cost + tx_cost.execution_cost), 12 ); 13 } 14 }); 15 }

I expect a noticeable performance drop, but surprisedly it wasn't the case. Before change:
https://solanalabs.slack.com/archives/CP2L2S4KV/p1624691336048200
after change: https://solanalabs.slack.com/archives/CP2L2S4KV/p1624759908048500
I don't know Rust's RwLock enough, just thought interesting to share.

Anyway, my actual intention is to move ln 8-12 to an additional thread in banking_stage, the four main threads would send tx to this new thread via channel, that thread will execute ln8-12 for every received tx. Do you have concern with this approach, @behzadnouri ? Is adding another thread to banking_stage an issue?

core/src/banking_stage.rs

core/src/cost_model.rs

behzadnouri · 2021-06-24T16:13:20Z

core/src/replay_stage.rs

+        );
+        update_cost_model_time.stop();
+
+        inc_new_counter_info!("replay_stage-update_cost_model", 1);
+        datapoint_info!(
+            "replay-loop-timing-stats",
+            (
+                "update_cost_model_elapsed",
+                update_cost_model_time.as_us() as i64,
+                i64
+            )
+        );


Please use ReplayTiming for these.

This one will be taken care of when my other PR being merged.
In PR#18123, I added a thread to replay_stage as service thread to handle slow/expensive ops, such as all cost model related stuff. While doing that, I added replayServiceTiming to collect stats form that thread. When merge, I will move this timing to ReplayServiceTiming.

I would then suggest to add these metrics in the same PR, or one the other one is already merged in.

behzadnouri

some comments, but LGTM

behzadnouri · 2021-06-28T18:42:56Z

core/src/banking_stage.rs

+        inc_new_counter_info!(
+            "banking_stage-unprocessed_transactions",
+            unprocessed_tx_count
+        );


Can you please add this metric and the other one above to banking_stage_stats as well?

behzadnouri · 2021-06-28T18:46:02Z

core/src/banking_stage.rs

+            transactions.iter().enumerate().for_each(|(index, tx)| {
+                if !unprocessed_tx_indexes.iter().any(|&i| i == index) {


How big we expect transcations and unprocessed_tx_indexes be here?
Wouldn't this cause an O(n^2) performance cost?

behzadnouri · 2021-06-28T18:47:45Z

core/src/replay_stage.rs

+        );
+        update_cost_model_time.stop();
+
+        inc_new_counter_info!("replay_stage-update_cost_model", 1);
+        datapoint_info!(
+            "replay-loop-timing-stats",
+            (
+                "update_cost_model_elapsed",
+                update_cost_model_time.as_us() as i64,
+                i64
+            )
+        );


I would then suggest to add these metrics in the same PR, or one the other one is already merged in.

CriesofCarrots · 2021-06-28T22:24:12Z

@taozhu-chicago , you probably want to save yourself some time (and save the CI machine time) by rebasing on master before you retry local-cluster tests any more times: 639a61d

tao-stones · 2021-06-28T22:59:36Z

@taozhu-chicago , you probably want to save yourself some time (and save the CI machine time) by rebasing on master before you retry local-cluster tests any more times: 639a61d

Thanks!

- calculate transaction cost - check transaction can fit in a block - update block cost tracker after transactions are added to block - replay_stage to update/insert execution cost to table

… is very expensive.

…se per transaction;

refactor cost checking with less frequent lock acquiring

is in the hot path - executed per transaction.

Pull request has been modified.

* Add stats and counter around cost model ops, mainly: - calculate transaction cost - check transaction can fit in a block - update block cost tracker after transactions are added to block - replay_stage to update/insert execution cost to table * Change mutex on cost_tracker to RwLock * removed cloning cost_tracker for local use, as the metrics show clone is very expensive. * acquire and hold locks for block of TXs, instead of acquire and release per transaction; * remove redundant would_fit check from cost_tracker update execution path * refactor cost checking with less frequent lock acquiring * avoid many Transaction_cost heap allocation when calculate cost, which is in the hot path - executed per transaction. * create hashmap with new_capacity to reduce runtime heap realloc. * code review changes: categorize stats, replace explicit drop calls, concisely initiate to default * address potential deadlock by acquiring locks one at time (cherry picked from commit 9d6f1eb) # Conflicts: # core/benches/banking_stage.rs # core/src/banking_stage.rs # core/src/cost_model.rs # core/src/cost_tracker.rs # core/src/execute_cost_table.rs # core/src/replay_stage.rs

* Add stats and counter around cost model ops, mainly: - calculate transaction cost - check transaction can fit in a block - update block cost tracker after transactions are added to block - replay_stage to update/insert execution cost to table * Change mutex on cost_tracker to RwLock * removed cloning cost_tracker for local use, as the metrics show clone is very expensive. * acquire and hold locks for block of TXs, instead of acquire and release per transaction; * remove redundant would_fit check from cost_tracker update execution path * refactor cost checking with less frequent lock acquiring * avoid many Transaction_cost heap allocation when calculate cost, which is in the hot path - executed per transaction. * create hashmap with new_capacity to reduce runtime heap realloc. * code review changes: categorize stats, replace explicit drop calls, concisely initiate to default * address potential deadlock by acquiring locks one at time

* Cost Model to limit transactions which are not parallelizeable (#16694) * * Add following to banking_stage: 1. CostModel as immutable ref shared between threads, to provide estimated cost for transactions. 2. CostTracker which is shared between threads, tracks transaction costs for each block. * replace hard coded program ID with id() calls * Add Account Access Cost as part of TransactionCost. Account Access cost are weighted differently between read and write, signed and non-signed. * Establish instruction_execution_cost_table, add function to update or insert instruction cost, unit tested. It is read-only for now; it allows Replay to insert realtime instruction execution costs to the table. * add test for cost_tracker atomically try_add operation, serves as safety guard for future changes * check cost against local copy of cost_tracker, return transactions that would exceed limit as unprocessed transaction to be buffered; only apply bank processed transactions cost to tracker; * bencher to new banking_stage with max cost limit to allow cost model being hit consistently during bench iterations * replay stage feed back program cost (#17731) * replay stage feeds back realtime per-program execution cost to cost model; * program cost execution table is initialized into empty table, no longer populated with hardcoded numbers; * changed cost unit to microsecond, using value collected from mainnet; * add ExecuteCostTable with fixed capacity for security concern, when its limit is reached, programs with old age AND less occurrence will be pushed out to make room for new programs. * investigate system performance test degradation (#17919) * Add stats and counter around cost model ops, mainly: - calculate transaction cost - check transaction can fit in a block - update block cost tracker after transactions are added to block - replay_stage to update/insert execution cost to table * Change mutex on cost_tracker to RwLock * removed cloning cost_tracker for local use, as the metrics show clone is very expensive. * acquire and hold locks for block of TXs, instead of acquire and release per transaction; * remove redundant would_fit check from cost_tracker update execution path * refactor cost checking with less frequent lock acquiring * avoid many Transaction_cost heap allocation when calculate cost, which is in the hot path - executed per transaction. * create hashmap with new_capacity to reduce runtime heap realloc. * code review changes: categorize stats, replace explicit drop calls, concisely initiate to default * address potential deadlock by acquiring locks one at time * Persist cost table to blockstore (#18123) * Add `ProgramCosts` Column Family to blockstore, implement LedgerColumn; add `delete_cf` to Rocks * Add ProgramCosts to compaction excluding list alone side with TransactionStatusIndex in one place: `excludes_from_compaction()` * Write cost table to blockstore after `replay_stage` replayed active banks; add stats to measure persist time * Deletes program from `ProgramCosts` in blockstore when they are removed from cost_table in memory * Only try to persist to blockstore when cost_table is changed. * Restore cost table during validator startup * Offload `cost_model` related operations from replay main thread to dedicated service thread, add channel to send execute_timings between these threads; * Move `cost_update_service` to its own module; replay_stage is now decoupled from cost_model. * log warning when channel send fails (#18391) * Aggregate cost_model into cost_tracker (#18374) * * aggregate cost_model into cost_tracker, decouple it from banking_stage to prevent accidental deadlock. * Simplified code, removed unused functions * review fixes * update ledger tool to restore cost table from blockstore (#18489) * update ledger tool to restore cost model from blockstore when compute-slot-cost * Move initialize_cost_table into cost_model, so the function can be tested and shared between validator and ledger-tool * refactor and simplify a test * manually fix merge conflicts * Per-program id timings (#17554) * more manual fixing * solve a merge conflict * featurize cost model * more merge fix * cost model uses compute_unit to replace microsecond as cost unit (#18934) * Reject blocks for costs above the max block cost (#18994) * Update block max cost limit to fix performance regession (#19276) * replace function with const var for better readability (#19285) * Add few more metrics data points (#19624) * periodically report sigverify_stage stats (#19674) * manual merge * cost model nits (#18528) * Accumulate consumed units (#18714) * tx wide compute budget (#18631) * more manual merge * ignore zerorize drop security * - update const cost values with data collected by #19627 - update cost calculation to closely proposed fee schedule #16984 * add transaction cost histogram metrics (#20350) * rebase to 1.7.15 * add tx count and thread id to stats (#20451) each stat reports and resets when slot changes * remove cost_model feature_set * ignore vote transactions from cost model Co-authored-by: sakridge <[email protected]> Co-authored-by: Jeff Biseda <[email protected]> Co-authored-by: Jack May <[email protected]>

tao-stones force-pushed the system-perf-tests branch from 6965929 to d8ddb79 Compare June 19, 2021 01:32

tao-stones requested review from sakridge and behzadnouri June 21, 2021 14:16

tao-stones marked this pull request as ready for review June 21, 2021 14:16

behzadnouri reviewed Jun 21, 2021

View reviewed changes

core/src/banking_stage.rs Show resolved Hide resolved

core/src/cost_model.rs Outdated Show resolved Hide resolved

tao-stones requested a review from behzadnouri June 21, 2021 22:40

behzadnouri reviewed Jun 24, 2021

View reviewed changes

tao-stones force-pushed the system-perf-tests branch from 8da39ac to 6239429 Compare June 28, 2021 14:42

behzadnouri previously approved these changes Jun 28, 2021

View reviewed changes

tao-stones added 10 commits June 28, 2021 18:02

Add stats and counter around cost model ops, mainly:

503579b

- calculate transaction cost - check transaction can fit in a block - update block cost tracker after transactions are added to block - replay_stage to update/insert execution cost to table

Change mutex on cost_tracker to rwlock

f2c6333

removed cloning cost_tracker for local use, as the metrics show clone…

14e17b7

… is very expensive.

acquire and hold locks for block of TXs, instead of acquire and relea…

0c1cb3f

…se per transaction;

remove redundant would_fit check from cost_tracker update execution path

7e9fa7e

refactor cost checking with less frequent lock acquiring

avoid many Transaction_cost heap allocation when calcualate cost, which

94b6814

is in the hot path - executed per transaction.

create hashmap with new_capacity to reduce runtime heap realloc.

9709692

concisely initiate to default

33ebbef

code review changes: categorize stats, replace explciti drop calls

0884054

address potential deadlock by acquiring locks one at time

cd8975a

tao-stones force-pushed the system-perf-tests branch from 6239429 to cd8975a Compare June 28, 2021 23:03

tao-stones merged commit 9d6f1eb into solana-labs:master Jun 29, 2021

tao-stones added the v1.7 label Jul 15, 2021

mergify bot mentioned this pull request Jul 15, 2021

investigate system performance test degradation (backport #17919) #18695

Closed

tao-stones removed the v1.7 label Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate system performance test degradation #17919

investigate system performance test degradation #17919

tao-stones commented Jun 13, 2021 •

edited

Loading

codecov bot commented Jun 13, 2021 •

edited

Loading

behzadnouri left a comment

tao-stones commented Jun 21, 2021

behzadnouri Jun 24, 2021

tao-stones Jun 25, 2021

tao-stones Jun 27, 2021 •

edited

Loading

behzadnouri Jun 24, 2021

tao-stones Jun 25, 2021

behzadnouri Jun 28, 2021

behzadnouri left a comment

behzadnouri Jun 28, 2021

behzadnouri Jun 28, 2021

behzadnouri Jun 28, 2021

CriesofCarrots commented Jun 28, 2021

tao-stones commented Jun 28, 2021

		let cost_model_readonly = cost_model.read().unwrap();
		let mut cost_tracker_mutable = cost_tracker.write().unwrap();

		transactions.iter().enumerate().for_each(\|(index, tx)\| {
		if !unprocessed_tx_indexes.iter().any(\|&i\| i == index) {

investigate system performance test degradation #17919

investigate system performance test degradation #17919

Conversation

tao-stones commented Jun 13, 2021 • edited Loading

Problem

Summary of Changes

codecov bot commented Jun 13, 2021 • edited Loading

Codecov Report

behzadnouri left a comment

Choose a reason for hiding this comment

tao-stones commented Jun 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tao-stones Jun 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behzadnouri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CriesofCarrots commented Jun 28, 2021

tao-stones commented Jun 28, 2021

tao-stones commented Jun 13, 2021 •

edited

Loading

codecov bot commented Jun 13, 2021 •

edited

Loading

tao-stones Jun 27, 2021 •

edited

Loading