Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validator 1.13.5 keep crashing #29714

Closed
monster2048 opened this issue Jan 15, 2023 · 6 comments
Closed

validator 1.13.5 keep crashing #29714

monster2048 opened this issue Jan 15, 2023 · 6 comments
Labels
community Community contribution

Comments

@monster2048
Copy link

Problem

1.13.5 running on 5950x with 128G memory very stable, but it crashes around every 30 mins on 7950X with 128G memory.
debug info before validator restart
/////////////////
[2023-01-15T03:11:15.085580810Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547972i num_shreds=16i
[2023-01-15T03:11:15.085585030Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547973i num_shreds=1380i
[2023-01-15T03:11:15.085588759Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547976i num_shreds=411i
[2023-01-15T03:11:15.085591889Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547974i num_shreds=2240i
[2023-01-15T03:11:15.085594889Z INFO solana_metrics::metrics] datapoint: receive_window_num_slot_shreds slot=172547975i num_shreds=1336i
[2023-01-15T03:11:15.085810367Z INFO solana_core::window_service] num addresses: 0, top packets by source: [(3.144.92.238:14701, 37), (5.62.126.196:8006, 36), (34.245.50.72:61555, 33), (23.109.172.92:8006, 33), (80.190.132.82:50006, 32)]
[2023-01-15T03:11:15.087663725Z ERROR solana_core::window_service] blockstore error: blockstore error
[2023-01-15T03:11:15.087710404Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 176, "samples": 175, "now": 1673752275087, "events": 1}
[2023-01-15T03:11:15.087741054Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 1652295104, computed = 647228894, type = 1 in /mnt/ledger/main/rocksdb/000541.sst offset 36390996 size 4631" }))
[2023-01-15T03:11:15.089619402Z ERROR solana_core::window_service] blockstore error: blockstore error
[2023-01-15T03:11:15.089650361Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 177, "samples": 176, "now": 1673752275089, "events": 1}
[2023-01-15T03:11:15.089662501Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 1652295104, computed = 647228894, type = 1 in /mnt/ledger/main/rocksdb/000541.sst offset 36390996 size 4631" }))
[2023-01-15T03:11:15.090799768Z INFO solana_metrics::metrics] datapoint: retransmit-first-shred slot=172547977i
[2023-01-15T03:11:15.091286662Z INFO solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=172546940i last_index=0i num_repaired=0i num_recovered=0i min_turbine_fec_set_count=0i is_full=false is_rooted=true is_dead=false
[2023-01-15T03:11:15.091554399Z INFO solana_metrics::metrics] datapoint: slot_stats_tracking_complete slot=172547865i last_index=777i num_repaired=0i num_recovered=315i min_turbine_fec_set_count=45i is_full=true is_rooted=false is_dead=false
[2023-01-15T03:11:15.091772306Z ERROR solana_core::window_service] blockstore error: blockstore error
[2023-01-15T03:11:15.091778926Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 178, "samples": 177, "now": 1673752275091, "events": 1}
[2023-01-15T03:11:15.091783606Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 1652295104, computed = 647228894, type = 1 in /mnt/ledger/main/rocksdb/000541.sst offset 36390996 size 4631" }))
[2023-01-15T03:11:15.098072591Z INFO solana_metrics::metrics] datapoint: retransmit-first-shred slot=172547978i
[2023-01-15T03:11:15.109928371Z INFO solana_metrics::metrics] datapoint: retransmit-stage-slot-stats slot=172547212i outset_timestamp=1673751890379i elapsed_millis=653i num_shreds=2718i num_nodes=2773i num_shreds_received_root=0i num_shreds_received_1st_layer=0i num_shreds_received_2nd_layer=2718i num_shreds_sent_root=0i num_shreds_sent_1st_layer=0i num_shreds_sent_2nd_layer=2773i
[2023-01-15T03:11:15.109955510Z INFO solana_metrics::metrics] datapoint: retransmit-stage-slot-stats slot=172547213i outset_timestamp=1673751890764i elapsed_millis=632i num_shreds=1298i num_nodes=1405i num_shreds_received_root=0i num_shreds_received_1st_layer=0i num_shreds_received_2nd_layer=1298i num_shreds_sent_root=0i num_shreds_sent_1st_layer=0i num_shreds_sent_2nd_layer=1405i
[2023-01-15T03:11:15.109958470Z INFO solana_metrics::metrics] datapoint: cluster_nodes_retransmit num_nodes=3599i num_nodes_dead=379i num_nodes_staked=2495i num_nodes_stale=497i
[2023-01-15T03:11:15.109960600Z INFO solana_metrics::metrics] datapoint: retransmit-stage total_time=696135i epoch_fetch=9271i epoch_cache_update=2927i total_batches=502i num_nodes=4767i num_addrs_failed=0i num_shreds=4337i num_shreds_skipped=624i retransmit_total=26762i compute_turbine=1107312i unknown_shred_slot_leader=0i
////////////////////

related info from syslog
/////////////////////
Jan 15 03:11:18 ba systemd[1]: solana-runner.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 03:11:18 ba systemd[1]: solana-runner.service: Failed with result 'exit-code'.
Jan 15 03:11:19 ba systemd[1]: solana-runner.service: Scheduled restart job, restart counter is at 8.
Jan 15 03:11:19 ba systemd[1]: Stopped Solana Validator.
Jan 15 03:11:19 ba systemd[1]: Started Solana Validator.
/////////////////////

Proposed Solution

@monster2048 monster2048 added the community Community contribution label Jan 15, 2023
@monster2048
Copy link
Author

monster2048 commented Jan 15, 2023

one difference I noticed is because single core is much more powerful on 7950x, looks like on 7950x, PoH module can keep other cores much more busy than 5950x, other core usage is about 50-60% on 7950x, but 30-40% on 5950x. racing access file corrupt DB?

Let me know if need collect more info.

Thanks.

@im-0
Copy link
Contributor

im-0 commented Jan 15, 2023

Few obvious questions:

  1. Do you run official Solana binaries or build yourself from the source code?
  2. Do you use ECC or regular non-ECC RAM?
  3. Is your system overclocked*?
  4. Do you use latest versions of motherboard firmware?

* Note that some "desktop" motherboards, especially those targeted at gamers, are using overclock by default. And you need to disable some features in BIOS settings to disable it. In my case instabilities on 5950x was fixed by changing "Performance Enhancer" to "Default" and "Performance bias" to "None" if I remember correctly.

@monster2048
Copy link
Author

monster2048 commented Jan 15, 2023

Thanks for getting back to me. Please see comment below
Do you run official Solana binaries or build yourself from the source code?
A: official build.
Do you use ECC or regular non-ECC RAM?
A: non-ECC
Is your system overclocked*?
A: Not by intention, but I'll double check on that. thanks.
Do you use latest versions of motherboard firmware?
A: Yes

One update is not every time, but I did see one kernel crash, it's triggered by validator process.
I'm trying to isolate the issue now, the first step is replace 2.5G NIC with a more popular 1G NIC, to see that cause the issue or not.

I guess this is not a validator bug, but more like a bug in kernel, I'll close this thread later if it's confirmed. Thanks.

@im-0
Copy link
Contributor

im-0 commented Jan 15, 2023

A: non-ECC

Try running memtest for some time, like a day or so if possible.

One update is not every time, but I did see one kernel crash, it's triggered by validator process.

Do you have a dmesg messages related to that kernel crash?

@steviez
Copy link
Contributor

steviez commented Jan 15, 2023

@monster2048 - This issue has cropped up occasionally (see #9009); we started work on some tooling to help recover from this situation (see #26813) but:

  1. The set of tools is incomplete, so there would be a decent amount of manual processing
  2. The tooling first comes into place in v1.14

If this is a staked/voting validator that you want to get online again, this is one of the few scenarios where I would recommend wiping your ledger (specifically, rocksdb directory) and starting fresh (remember to adjust validator args if you destroy genesis.bin and/or snapshots and will need to redownload). Your nodes specs make me think that you're running a validator and while this isn't great, it should be acceptable.

As for the issue itself, SST files are ones that live on disk. The error is indicating that one of the blocks in one of these files was corrupted. As mentioned, we don't see this much but it has seemingly been some low level fault (OS or the HW itself) in the past.

Once you remove the ledger, continuing to run into this error might indicate some issue with your setup.

@monster2048
Copy link
Author

Thanks guys, I believe the issue is memory frequency doesn't match, bios set it to 3600MT/s by default, after I changed it to 4800MT/s, it's been running for 24+ hours with no issue. Thanks for helping again, I'll close this bug now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

No branches or pull requests

3 participants