-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tmkms freeze #37
Comments
There have been reports of similar issues in the past (e.g. tendermint/tmkms#389) however I have never been able to reproduce them. Can you see if you can reproduce the issue while running |
I've started strace watching the trims process. It generates enormous log files! I'll watch and avoid running out of disk space, and keep it running until the next halt. I am not aware of any way to trigger the failure, so it may take a while. |
@mattharrop I've just released v0.7.3 which includes a number of bugfixes. Not sure any of them will help this issue per se, but there were some related to YubiHSM device (re-)enumeration which could potentially be the source of this problem |
And just now I finally, after three weeks of running under |
I only need a little bit of recent context, maybe the last 10,000 lines or so? |
Hi! We just ran into the same issue. It's our second time but the first time, I think we upgraded tmkms. @mattharrop which version did you run for the strace? |
We managed to isolate a probable location where the problem is occurring. I'll add some more debug logging around this area then cut a v0.7.4 release. |
This is an attempt to help address #37. Based on `strace` logging it appears at least one of the instances of this bug occurred during a lock acquisition happening immediately after persisting the chain state. The system call sequence looked something like this: ``` close(12) = 0 rmdir("/.../.atomicwrite.InysUcmuRax7") = 0 futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL ``` Unfortunately this isn't a whole lot to go on, but makes it appear as if it's hanging trying to acquire a lock immediately after persisting the consensus state to disk. This commit does a couple things to try to narrow down what is happening: 1. Ensures that an exclusive lock to the chain state isn't held while the signing operation is being performed (i.e. while communicating with the HSM). If we were able to update the consensus state, that means the signing operation is authorized, and we no longer need to hold the lock. In the event the signing operation fails, the validator will miss the block in question, but with no risk of double-signing. 2. Adds a significant amount of additional debug logging, particularly around things like lock acquisition and writing to disk. While this commit is unlikely to fix #37 in and of itself, the additional debug logging should be helpful in isolating the problem.
Looks like #60 triggered auto-close, but it (likely) doesn't actually address the issue, but should help debug it. |
I've released v0.8.0-alpha1 which corrects a few minor things that might be going amiss but also adds considerably more debug logging: https://forum.cosmos.network/t/ann-tendermint-kms-v0-8-tendermint-v0-33-support/3704 If you can reproduce this issue while running in verbose mode (i.e. with the We're presently running this release in production. |
Just FYI, we also experienced the same issue once, a couple of weeks ago, while running 0.7.2. It didn't occur again and as of 4 days ago we are on 0.8.0. |
We haven't had any reports of this since re-occurring since the v0.8 release, so I'm going to go ahead and close this. Please let me know if that changes. |
These loglines were for helping debug #37. Now that it's closed, they can be removed as they are quite noisy.
These loglines were for helping debug #37. Now that it's closed, they can be removed as they are quite noisy.
Again last night :(
|
For anyone encountering this issue in production (we have never encountered it before, so that makes debugging that much harder), it would be extremely helpful if someone would attach to the process with something like gdb or lldb and obtain a backtrace (e.g. with the That said, Tendermint v0.35 will include a gRPC-based privval connection between TMKMS and a validator, which will completely change the entire networking stack on both the TMKMS and Tendermint side. Perhaps that will indirectly fix the issue, or if not, it will at least help isolate the possible places it's occurring. |
The same for me after a few weeks of uptime:
The server is dedicated to TMKMS
Right now I have automated restarting the process after 60 seconds without log. |
If anyone is experiencing something like this, especially frequently, it'd be very helpful to get a traffic capture with a tool like tcpdump. There are a number of things that could potentially cause this which are external to TMKMS, including it never receiving an incoming request. If you can use a tool like tcpdump to rule that out and show a case where TMKMS is actually receiving incoming network packets but never processing them, then that would be a TMKMS bug. Based on some other anecdotal reports from people who were experiencing issues like these, they were ultimately a network problem. |
During the issue I was able to verify that the network connections worked (the destinations IP are all different, all reachable) and a simple restart of the TMKMS service fix the problem. |
If you can build TMKMS in debug mode (i.e. omit the |
Same here, simple restart fixes. I've set tmkms to restart every two weeks regardless, it seems that the issue happens around every 3 weeks or so. Also, IMHO it is unlike it is a network problem for us: the TMKMS VM is on the same physical host as the validator, they are VMs under the same hypervisor. @AC-Validator which method are you using to detect logging stopped? |
Mmm... It is quite intrusive for a production setup... |
Analyzing the response latencies it seems that the YubiHSM2 has no particular problems (it also depends on the blocking time of each chain). So I prefer active/passive clusters with this number of connections (~10) |
@tony-iqlusion are you ready to let me do some tests? |
Not quite yet... there's still some more work on the |
Context: iqlusioninc/tmkms#37 To summarize, the `yubihsm::Client` type has internal locking for a `Session` with the YubiHSM. There is actually very little lock-related code associated with this scheme, but there appears to be a subtle reentrancy bug in pretty much the only function where it could possibly exist: the `yubihsm::Client::send_command` method. Based on discussions on iqlusioninc/tmkms#37 the bug definitely seemed to be tied to rekeying the session after a preset number of messages had been sent. It appears there was a reentrancy bug where an RAII guard was still holding the `Mutex` for the session state, which prevented a subsequent attempt to resent the message from acquiring it. The solution is to explicitly drop the stale session guard prior to trying to establish a new session.
Context: iqlusioninc/tmkms#37 To summarize, the `yubihsm::Client` type has internal locking for a `Session` with the YubiHSM. There is actually very little lock-related code associated with this scheme, but there appears to be a subtle reentrancy bug in pretty much the only function where it could possibly exist: the `yubihsm::Client::send_command` method. Based on discussions on iqlusioninc/tmkms#37 the bug definitely seemed to be tied to rekeying the session after a preset number of messages had been sent. It appears there was a reentrancy bug where an RAII guard was still holding the `Mutex` for the session state, which prevented a subsequent attempt to resent the message from acquiring it. The solution is to explicitly drop the stale session guard prior to trying to establish a new session.
I think I have a good guess as to what the core problem is here, and a prospective fix if my guess is right: I suspect it's a mutex reentrancy bug, where the same thread blocks itself attempting to acquire an already-held mutex. It occurs in the code which handles message volume limits and rekeying which seemed to be implicated based on the above comments. If anyone is interested in extremely-alpha testing this, I can make a git branch you can build from. But that will be a little bit of work. |
I'm curious why your setup never experienced this. |
I've released TMKMS v0.11.0-pre.2 which includes the prospective fix this issue. We're now running it in production.
We just experienced it for the first time recently. As to why we didn't experience it before I'm uncertain, but it's possible my guess as to the cause is wrong. |
Happy to try it! Thanks Tony! |
@tony-iqlusion we are in production, could you give an indication to be safe?
|
The Minimum Supported Rust Version is now 1.56. You need to upgrade your rustc version |
Done, thanks. Reported version is always v0.10.1. Right?
|
No, using You need to do:
|
A quick update: |
Going to go ahead and close this. If anyone experiences another freeze with tmkms v0.11.0-pre.2 or subsequent releases, please leave a comment and I'll reopen this issue. |
FYI, we've released v0.11.0 containing a fix for this issue #479 |
The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 This is still a WIP as I'm hoping to find a way to close the Client's session w/o having to modify the yubihsm.rs code.
The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 The replacement was to provide a `session()` function that returns an Arc / MutexGuard wrapped reference to the optional session. This isn't useful for us here because we don't and AFAIK can't take ownership of the session which we need because the Sesison::close function consumes the session (it can't be reopened). Our solution requires an upstream change to the `Client` type adding a `close_session` function that just closes the session if one is open.
The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 The replacement was to provide a `session()` function that returns an Arc / MutexGuard wrapped reference to the optional session. This isn't useful for us here because we don't and AFAIK can't take ownership of the session which we need because the Sesison::close function consumes the session (it can't be reopened). Our solution requires an upstream change to the `Client` type adding a `close_session` function that just closes the session if one is open.
The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 The replacement was to provide a `session()` function that returns an Arc / MutexGuard wrapped reference to the optional session. This isn't useful for us here because we don't and AFAIK can't take ownership of the session which we need because the Sesison::close function consumes the session (it can't be reopened). Our solution requires an upstream change to the `Client` type adding a `close_session` function that just closes the session if one is open.
The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 The replacement was to provide a `session()` function that returns an Arc / MutexGuard wrapped reference to the optional session. This isn't useful for us here because we don't and AFAIK can't take ownership of the session which we need because the Sesison::close function consumes the session (it can't be reopened). Our solution requires an upstream change to the `Client` type adding a `close_session` function that just closes the session if one is open.
The `Client` type previously had a `Drop` implementation that closed the session if the Client had an open one. This seems to have caused problems in other downstream projects and was subsequently removed: iqlusioninc/tmkms#37 iqlusioninc/yubihsm.rs#265 The replacement was to provide a `session()` function that returns an Arc / MutexGuard wrapped reference to the optional session. This isn't useful for us here because we don't and AFAIK can't take ownership of the session which we need because the Sesison::close function consumes the session (it can't be reopened). Our solution requires an upstream change to the `Client` type adding a `close_session` function that just closes the session if one is open.
I'm having an issue with silent failures. tmkms will stop signing, with no errors logged. It just stops signing. The last log entry is the last signature. Restarting tmkms resolves the problem.
This is v0.7.2, on ubuntu. I have this issue on two instances, with 2 keys and 3 keys on a yubihsm in use respectively. The problem re-occurs irregularly, seems to be roughly every few weeks.
At this point I have very little to go on, and no useful information to share.
Is there anything I can do to produce useful debugging information next time the failure happens?
The text was updated successfully, but these errors were encountered: