Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network crash at 4 billion active stake milestone #3691

Open
joseluis-varo-wsg opened this issue Nov 18, 2024 · 1 comment
Open

Network crash at 4 billion active stake milestone #3691

joseluis-varo-wsg opened this issue Nov 18, 2024 · 1 comment

Comments

@joseluis-varo-wsg
Copy link

Problem

Hello,

We've deployed a net from scratch, with 4 validator nodes. We started out at 5 sols per node, and delegated a billion sols after. Recently, we reached 4 billion sols active stake (for comparison, solana mainnet has yet to reach 400 million as of writing), and our net crashed. We can't boot it back up either, our last snapshot is shortly before the epoch change and subsequent 4 billion stake milestone.

This is our error log:

[2024-11-14T01:26:25.974747961Z INFO  solana_runtime::bank] new epoch stakes, epoch: 119, total_stake: 4000000019999973760
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at ' panicked at ' panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:::436436436:::101010:
:
:
called `Option::unwrap()` on a `None` valuecalled `Option::unwrap()` on a `None` valuecalled `Option::unwrap()` on a `None` value
 
 
stack backtrace:
[2024-11-14T01:26:26.057024521Z INFO  solana_metrics::metrics] datapoint: tpu-verifier recv_batches_us_90pct=0i recv_batches_us_min=0i recv_batches_us_max=0i recv_batches_us_mean=0i verify_batches_pp_us_90pct=0i verify_batches_pp_us_min=0i verify_batches_pp_us_max=0i verify_batches_pp_us_mean=0i discard_packets_pp_us_90pct=0i discard_packets_pp_us_min=0i discard_packets_pp_us_max=0i discard_packets_pp_us_mean=0i dedup_packets_pp_us_90pct=0i dedup_packets_pp_us_min=0i dedup_packets_pp_us_max=0i dedup_packets_pp_us_mean=0i batches_90pct=0i batches_min=0i batches_max=0i batches_mean=0i packets_90pct=0i packets_min=0i packets_max=0i packets_mean=0i num_deduper_saturations=0i total_batches=0i total_packets=0i total_dedup=0i total_excess_fail=0i total_valid_packets=0i total_discard_random=0i total_shrinks=0i total_dedup_time_us=0i total_discard_time_us=0i total_discard_random_time_us=0i total_verify_time_us=0i total_shrink_time_us=0i
[2024-11-14T01:26:26.104300166Z INFO  solana_metrics::metrics] datapoint: connection_cache_tpu_quic cache_hits=2i cache_misses=0i cache_evictions=0i eviction_time_ms=0i get_connection_ms=0i get_connection_lock_ms=0i get_connection_hit_ms=0i get_connection_miss_ms=0i make_connection_ms=0i total_connections=0i connection_reuse=2i connection_errors=0i zero_rtt_accepts=0i zero_rtt_rejects=0i congestion_events=0i streams_blocked_uni=0i data_blocked=0i acks=0i num_packets=2i total_batches=2i batch_failure=0i send_timeout=0i average_send_packet_us=0i successful_packets=0i average_prepare_connection_us=0i
[2024-11-14T01:26:26.229872130Z INFO  solana_metrics::metrics] datapoint: serve_repair-requests_received total_requests=0i dropped_requests_outbound_bandwidth=0i dropped_requests_load_shed=0i dropped_requests_low_stake=0i whitelisted_requests=0i total_dropped_response_packets=0i handle_requests_staked=0i handle_requests_unstaked=0i processed=0i total_response_packets=0i total_response_bytes_staked=0i total_response_bytes_unstaked=0i self_repair=0i window_index=0i request-highest-window-index=0i orphan=0i serve_repair-request-ancestor-hashes=0i pong=0i window_index_misses=0i ping_cache_check_failed=0i pings_sent=0i decode_time_us=0i handle_requests_time_us=0i err_time_skew=0i err_malformed=0i err_sig_verify=0i err_unsigned=0i err_id_mismatch=0i
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:117:5
   3: solana_stake_program::stake_state::redeem_rewards
[2024-11-14T01:26:26.309572830Z INFO  solana_metrics::metrics] datapoint: ancestor_hashes_responses total_packets=0i processed=0i dropped_packets=0i invalid_packets=0i ping_count=0i ping_err_verify_count=0i
   4: solana_runtime::bank::Bank::redeem_rewards::{{closure}}::{{closure}}
   5: rayon::iter::plumbing::Folder::consume_iter
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
[2024-11-14T01:26:26.334952317Z INFO  solana_core::window_service] num addresses: 0, top packets by source: []
[2024-11-14T01:26:26.334956735Z INFO  solana_metrics::metrics] datapoint: blockstore-insert-shreds num_shreds=0i total_elapsed_us=0i insert_lock_elapsed_us=0i insert_shreds_elapsed_us=0i shred_recovery_elapsed_us=0i chaining_elapsed_us=0i commit_working_sets_elapsed_us=0i write_batch_elapsed_us=0i num_inserted=0i num_repair=0i num_recovered=0i num_recovered_inserted=0i num_recovered_failed_sig=0i num_recovered_failed_invalid=0i num_recovered_exists=0i num_recovered_blockstore_error=0i num_repaired_data_shreds_exists=0i num_turbine_data_shreds_exists=0i num_data_shreds_invalid=0i num_data_shreds_blockstore_error=0i num_coding_shreds_exists=0i num_coding_shreds_invalid=0i num_coding_shreds_invalid_erasure_config=0i num_coding_shreds_inserted=0i
[2024-11-14T01:26:26.334964019Z INFO  solana_metrics::metrics] datapoint: recv-window-insert-shreds handle_packets_elapsed_us=0i run_insert_count=0i num_packets=0i num_repairs=0i num_shreds_received=0i shred_receiver_elapsed_us=0i prune_shreds_elapsed_us=0i num_shreds_pruned_invalid_repair=0i num_errors=15i num_errors_blockstore=0i num_errors_other=0i num_errors_try_crossbeam_send=0i num_errors_cross_beam_recv_timeout=15i
   7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   8: rayon_core::registry::WorkerThread::wait_until_cold
   9: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:117:5
   3: solana_stake_program::stake_state::redeem_rewards
   4: solana_runtime::bank::Bank::redeem_rewards::{{closure}}::{{closure}}
   5: rayon::iter::plumbing::Folder::consume_iter
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   8: rayon_core::registry::WorkerThread::wait_until_cold
[2024-11-14T01:26:26.396545175Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="?" one=1i message="panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10:
    called `Option::unwrap()` on a `None` value" location="/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10" version="1.17.27 (src:00000000; feat:3746964731, client:SolanaLabs)"
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
[2024-11-14T01:26:26.424312324Z INFO  solana_metrics::metrics] datapoint: memory-stats total=16768888832i swap_total=2148528128i free_percent=1.6338187386464036 used_bytes=9155133440i avail_percent=45.40405430722817 buffers_percent=0.4778983318624698 cached_percent=38.08877795057947 swap_free_percent=100
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
[2024-11-14T01:26:26.452182033Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="?" one=1i message="panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10:
    called `Option::unwrap()` on a `None` value" location="/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10" version="1.17.27 (src:00000000; feat:3746964731, client:SolanaLabs)"
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:117:5
   3: solana_stake_program::stake_state::redeem_rewards
   4: solana_runtime::bank::Bank::redeem_rewards::{{closure}}::{{closure}}
   5: rayon::iter::plumbing::Folder::consume_iter
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   8: rayon_core::registry::WorkerThread::wait_until_cold
   9: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-11-14T01:26:26.452475536Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="?" one=1i message="panicked at /opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10:
    called `Option::unwrap()` on a `None` value" location="/opt/bins/solana/1.17.27/solana-1.17.27/programs/stake/src/stake_state.rs:436:10" version="1.17.27 (src:00000000; feat:3746964731, client:SolanaLabs)"

Since we're seeing a None there right after a checked_mul() and are dealing with ...slightly oversized numbers, we're suspecting an integer overflow here. Then again, u128 should leave ample room to work with on that multiplication, so we're unsure how it could possibly overflow.

Software version is v1.17.34.

We had previously tracked the affected line down expecting a fix or something in the latest versions, so for convenience's sake, here it is: It got moved to rewards.rs following a refactor, albeit with little change to the underlying logic, and we haven't seen any indication that this behavior should vary across versions.

(Note: reopening this here after realizing the old repo is no longer in use, so the links above won't quite match)

Proposed Solution

None, I'm afraid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
@joseluis-varo-wsg and others