-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2x Warehouse Nodes Stopped Replaying and Uploading Blocks But Remained Online #29901
Comments
I believe bigtable will only upload rooted slots. It is possible that banks were getting frozen but not rooted (ie a consensus divergence); can you check your logs or metrics to see if your nodes were freezing banks? |
it happened again on 5oXsXDXM8kmcK3KjSpurxB8DsJUqWCrSiD2kcLWEniSy logs: https://drive.google.com/file/d/1WVpB8pjVoUHg7WWlWdtd4uggE8SmE_fv/view?usp=share_link bigtable upload (times are central time, jan 25 2023) |
last frozen bank:
|
Taking a quick look at your log, I would agree that deadlock seems like a possibility. I think getting some stracktraces would be helpful to figure out where things are stuck. General process:
With the log, you should be able to inspect the backtraces and see which threads are stuck. |
Trying to get gdb installed on a package manager-less system is a much more daunting task than I would have imagined. In the meantime I downgraded one of the nodes to 1.13.6 and seeing the same symptoms of no bigtable uploads |
We've seen issues similar to this and have founding removing the accounts, ledger and snapshots directory and restarting while fetching the snapshot will reestablish bt connection + upload reliably. I was hoping this patch in 1.13.6 would stop the bt upload sync #28728 |
Did your node completely lock up (ie stop replaying slot) or did your node just stop uploading to bigtable while continuing to remain with the tip of the cluster? The issue you linked sounds like a session issue (ie the latter) whereas this issue is exploring a fundamental issue where the validator client grinds to a halt across the board.
This is a bandaid, and should really be a last resort. Did you ever try stopping/restarting your nodes along the way without wiping everything? If you're still encountering this, do you currently have any open issues and/or PR's ? Additionally, if you're uploading to bigtable, suppose:
You now have a hole in your history from |
Here's a gdb stacktrace of our locked up node |
Ah I read too fast apologize, we also have this as a separate issue. I'm referring to the one where the validator continues to sink but loses the bt session, fixed by the latest release. My validators have been very inconsistently syncing as of late
Yeah we have an alerter and backfill async job that fills that with the solana ledger tool |
Adding to this thread that 1.13.5 ran into the issue, bumping down to 1.13.4 |
Hi @segfaultdoc - the log you posted appears to be truncated a bit, I don't see all threads; any chance you saved the output and have the full log available? |
Here it is: |
Thanks for the updated gdb output. It looks like some of the threads are still missing, but there is output for more threads from this second output which help shed some light. It looks like some threads are stuck on this lock: solana/ledger/src/blockstore.rs Line 181 in d53c49c
I see solana/core/src/ledger_cleanup_service.rs Line 240 in 46fde27
Several solana/ledger/src/blockstore.rs Line 2576 in 46fde27
We do The full output could be helpful in narrowing down the search space, altho there should only be a handful of threads that grab a writer on that lock so we might be able to proceed without it as well. |
Got back to this, looks like we have a call-path that will attempt to obtain the a read-lock a second time while it is already held. The callpath is:
solana/ledger/src/blockstore.rs Lines 1920 to 1932 in bfc29a6
solana/ledger/src/blockstore.rs Lines 1934 to 2017 in bfc29a6
solana/ledger/src/blockstore.rs Lines 2777 to 2780 in bfc29a6
solana/ledger/src/blockstore.rs Lines 2784 to 2832 in bfc29a6
solana/ledger/src/blockstore.rs Lines 2860 to 2882 in bfc29a6
So, if |
@buffalu @segfaultdoc - The proposed fix is a little scary given that we're playing with locks, so we just want to be extra sure that there isn't anything else at play. Could you please confirm whether this warehouse node is doing anything else (ie serving RPC requests)? Additionally, can you comment on CPU utilization before/during the observed hang? |
Problem
Our two warehouse nodes running v1.14.13-jito stopped processing slots last night. They were still online, but weren't able to progress. One stopped around 3AM CT and another stopped around 7:45AM CT.
They also stopped uploading to bigtable
Identities:
Proposed Solution
Determine why they stopped freezing slots and fix
The text was updated successfully, but these errors were encountered: