-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.16 nodes general protection fault #32940
Comments
The mce2 log contained 11_177_732 instances of this warning:
I've saved the log in |
|
|
@Timoon21 Can you add the results of running: uname -srv too, please? |
|
This warning with this incidence rate typically indicates that the node deviated from cluster consensus. This node panicked not terribly long after:
Pre-panic, we see the corresponding message:
The node restarted shortly before:
And looks like it finished initialization before it started dropping votes; init finished at 17:14:15 whereas first vote dropped at 17:15:56.
The node did output a debug file:
We could have compared this file to one generated from replaying the same slot |
As for the actual fault, the node restarted shortly before
And the fault happened here:
However, the node would not have started blockstore processing yet:
The threads in this pool are only accessed to perform transaction processing within solana/ledger/src/blockstore_processor.rs Lines 90 to 96 in 0f41719
Here is the only place this thread pool is used: solana/ledger/src/blockstore_processor.rs Lines 226 to 239 in 0f41719
Given that the node had not started processing blocks yet, these threads would not have had work getting sent to them yet. So, it would seem to me that some other thread corrupted things that lead to |
wow, segvs :) |
Unfortunately, I don't think we can reliably reproduce at the moment. Additionally, I think running with a debugger attached would slow the process down enough that the node would be unable to keep up which would obfuscate debugging efforts. Core dump might be an option tho, make sure they're enabled and compile with something like this (this is what I have used for mem profiling):
|
I've had pretty good success reproducing this. I compiled a binary w debug symbols and expect to be able to repro in the next day or so. Once I have a core dump w debug symbols it's GG. |
Reposting from discord -- was able to replicate w the debug build bt it seems the program crashed inside of jemalloc Am running it again so we have even more data It's also interesting that in Timoon's discord post the code crashes 3x at the same exact offset. It's possible we're corrupting jemalloc's memory in a way that manifests on that specific instruction it seems to just be updating stats... one of those pointers seems to have gone bad |
have the same offset too and 384Gb RAM |
Summarizing latest from GitHub channel (https://discord.com/channels/428295358100013066/1146546152309801142): Sounds like we're leaning more towards logic bug as opposed to HW level issue given we're seeing crashes happen in the exact same place w/ same backtrace (would expect random bit flips to result in more random symptoms). But we still see some system level affinity - e.g. Ben can repro easily while other operators have never seen this issue. Seems like an outsized number of Ubuntu 22.04 systems have seen the problem, but we've also observed on 20.04 and across different Kernel versions. We've observed on both Labs and Jito clients. Only observed on v1.16. Unable to repro when replaying the same slot we previously died on, so it's not some straightforward logic bug. -We've also seen crash during startup before we've even started replaying blocks (still in blockstore processor thread) - still unpacking snapshot- (not clear if we can trust log timestamps to say this for sure). We are crashing inside of malloc. The layout is correct and the memory metrics look fine --> mem corruption. We crashed while updating jemalloc metrics. Seems likely jemalloc memory was corrupted by some other thread. No jemalloc changes between v1.14 and v1.16. With --enable-debug enabled, the variable / stack region that gets corrupted is "sopts" (static options) in the __malloc_default call. It's always that same region of memory that gets corrupted -- so definitely something deterministically writing out of bounds or something. Current suspicion is something in the runtime. Alessandro has a speculative fix for possible heap corruption: alessandrod/rbpf@c9b2a6a Running multiple machines against MNB that have seen the GPF in an attempt to reproduce. Also asking Ben to run w/ Alessandro's speculative fix to see if he can still repro |
fixed in 1.16.14 |
Several nodes have gotten GPF while running v1.16 on mainnet-beta. Debugging discussion is on Discord in the debug-gpf-1_16 channel. The faults are caused by one of the
solBstoreProcXY
threads in this threadpool; this pool is used solely for transaction replay. Here is a table trackingmcb7 on 2023-07-03
https://discord.com/channels/428295358100013066/1027231858565586985/1125509913746083961
mce2 on 2023-08-22
mce2
was running9d83bb2a
when solana-validator received SEGV at2023-08-22 17:45:12 UTC
/var/log/kern.log
:/var/log/apport.log
:addr2line -e /home/sol/.local/share/solana/install/active_release/bin/solana-validator --functions --demangle 0x22e5000
2023-09-11 meyebro
The text was updated successfully, but these errors were encountered: