-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validator 1.13.5 keep crashing #29714
Comments
one difference I noticed is because single core is much more powerful on 7950x, looks like on 7950x, PoH module can keep other cores much more busy than 5950x, other core usage is about 50-60% on 7950x, but 30-40% on 5950x. racing access file corrupt DB? Let me know if need collect more info. Thanks. |
Few obvious questions:
* Note that some "desktop" motherboards, especially those targeted at gamers, are using overclock by default. And you need to disable some features in BIOS settings to disable it. In my case instabilities on 5950x was fixed by changing "Performance Enhancer" to "Default" and "Performance bias" to "None" if I remember correctly. |
Thanks for getting back to me. Please see comment below One update is not every time, but I did see one kernel crash, it's triggered by validator process. I guess this is not a validator bug, but more like a bug in kernel, I'll close this thread later if it's confirmed. Thanks. |
Try running memtest for some time, like a day or so if possible.
Do you have a dmesg messages related to that kernel crash? |
@monster2048 - This issue has cropped up occasionally (see #9009); we started work on some tooling to help recover from this situation (see #26813) but:
If this is a staked/voting validator that you want to get online again, this is one of the few scenarios where I would recommend wiping your ledger (specifically, rocksdb directory) and starting fresh (remember to adjust validator args if you destroy genesis.bin and/or snapshots and will need to redownload). Your nodes specs make me think that you're running a validator and while this isn't great, it should be acceptable. As for the issue itself, SST files are ones that live on disk. The error is indicating that one of the blocks in one of these files was corrupted. As mentioned, we don't see this much but it has seemingly been some low level fault (OS or the HW itself) in the past. Once you remove the ledger, continuing to run into this error might indicate some issue with your setup. |
Thanks guys, I believe the issue is memory frequency doesn't match, bios set it to 3600MT/s by default, after I changed it to 4800MT/s, it's been running for 24+ hours with no issue. Thanks for helping again, I'll close this bug now. |
Problem
1.13.5 running on 5950x with 128G memory very stable, but it crashes around every 30 mins on 7950X with 128G memory.
debug info before validator restart
/////////////////
[2023-01-15T03:11:15.085580810Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547972i num_shreds=16i
[2023-01-15T03:11:15.085585030Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547973i num_shreds=1380i
[2023-01-15T03:11:15.085588759Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547976i num_shreds=411i
[2023-01-15T03:11:15.085591889Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547974i num_shreds=2240i
[2023-01-15T03:11:15.085594889Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547975i num_shreds=1336i
[2023-01-15T03:11:15.085810367Z INFO solana_core::window_service] num addresses: 0, top packets by source: [(3.144.92.238:14701, 37), (5.62.126.196:8006, 36), (34.245.50.72:61555, 33), (23.109.172.92:8006, 33), (80.190.132.82:50006, 32)]
[2023-01-15T03:11:15.087663725Z ERROR solana_core::window_service] blockstore error: blockstore error
[2023-01-15T03:11:15.087710404Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 176, "samples": 175, "now": 1673752275087, "events": 1}
[2023-01-15T03:11:15.087741054Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 1652295104, computed = 647228894, type = 1 in /mnt/ledger/main/rocksdb/000541.sst offset 36390996 size 4631" }))
[2023-01-15T03:11:15.089619402Z ERROR solana_core::window_service] blockstore error: blockstore error
[2023-01-15T03:11:15.089650361Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 177, "samples": 176, "now": 1673752275089, "events": 1}
[2023-01-15T03:11:15.089662501Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 1652295104, computed = 647228894, type = 1 in /mnt/ledger/main/rocksdb/000541.sst offset 36390996 size 4631" }))
[2023-01-15T03:11:15.090799768Z INFO solana_metrics::metrics] datapoint: retransmit-first-shred slot=172547977i
[2023-01-15T03:11:15.091286662Z INFO solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=172546940i last_index=0i num_repaired=0i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=true is_dead=false
[2023-01-15T03:11:15.091554399Z INFO solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=172547865i last_index=777i num_repaired=0i num_recovered=315i min_turbine_fec_set_count=45i is_full=true is_rooted=false is_dead=false
[2023-01-15T03:11:15.091772306Z ERROR solana_core::window_service] blockstore error: blockstore error
[2023-01-15T03:11:15.091778926Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 178, "samples": 177, "now": 1673752275091, "events": 1}
[2023-01-15T03:11:15.091783606Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 1652295104, computed = 647228894, type = 1 in /mnt/ledger/main/rocksdb/000541.sst offset 36390996 size 4631" }))
[2023-01-15T03:11:15.098072591Z INFO solana_metrics::metrics] datapoint: retransmit-first-shred slot=172547978i
[2023-01-15T03:11:15.109928371Z INFO solana_metrics::metrics] datapoint: retransmit-stage-slot-stats slot=172547212i outset_timestamp=1673751890379i elapsed_millis=653i num_shreds=2718i num_nodes=2773i num_shreds_received_root=0i num_shreds_received_1st_layer=0i num_shreds_received_2nd_layer=2718i num_shreds_sent_root=0i num_shreds_sent_1st_layer=0i num_shreds_sent_2nd_layer=2773i
[2023-01-15T03:11:15.109955510Z INFO solana_metrics::metrics] datapoint: retransmit-stage-slot-stats slot=172547213i outset_timestamp=1673751890764i elapsed_millis=632i num_shreds=1298i num_nodes=1405i num_shreds_received_root=0i num_shreds_received_1st_layer=0i num_shreds_received_2nd_layer=1298i num_shreds_sent_root=0i num_shreds_sent_1st_layer=0i num_shreds_sent_2nd_layer=1405i
[2023-01-15T03:11:15.109958470Z INFO solana_metrics::metrics] datapoint: cluster_nodes_retransmit num_nodes=3599i num_nodes_dead=379i num_nodes_staked=2495i num_nodes_stale=497i
[2023-01-15T03:11:15.109960600Z INFO solana_metrics::metrics] datapoint: retransmit-stage total_time=696135i epoch_fetch=9271i epoch_cache_update=2927i total_batches=502i num_nodes=4767i num_addrs_failed=0i num_shreds=4337i num_shreds_skipped=624i retransmit_total=26762i compute_turbine=1107312i unknown_shred_slot_leader=0i
////////////////////
related info from syslog
/////////////////////
Jan 15 03:11:18 ba systemd[1]: solana-runner.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 03:11:18 ba systemd[1]: solana-runner.service: Failed with result 'exit-code'.
Jan 15 03:11:19 ba systemd[1]: solana-runner.service: Scheduled restart job, restart counter is at 8.
Jan 15 03:11:19 ba systemd[1]: Stopped Solana Validator.
Jan 15 03:11:19 ba systemd[1]: Started Solana Validator.
/////////////////////
Proposed Solution
The text was updated successfully, but these errors were encountered: