-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: pick up upstream fix for Raft entries handling #28918
Comments
Oh wait, this was on @nvanbenschoten and my sketchy count * branch. |
Mind pointing to that branch? I don't think we're doing anything that would have an effect on a range's raft log in that change, but I'd like to be sure. It will also be helpful for lining up the stack trace. |
I captured a snapshot of the data directory on the first node which died. It's 71 gigabytes, uncompressed. I can upload it, but I'm assuming nobody will actually want to look at that. |
@jordanlewis did anything strange happen on that node before this? Was it recently restarted? Was it about to run out of memory? This is on the |
Nothing weird happened on the node. It was the node I was executing those COUNT queries for the demo on, but nothing besides that. The logs are available - Let's follow up about this offline. |
@tschottdorf how do these |
SGTM |
FYI your nodes are running slightly different binaries:
Note the different build time on n4. I realize the build SHA is the same, but I'm not super confident in our build SHA injection. Is it possible this is to blame? |
I thought we'd removed most of the panics in |
@benesch, n4 isn't part of the cluster. |
(To be more clear, I was intending to use it as the load generator in some tests - and I never started Cockroach on it or joined it to the cluster.) |
Also, the "panic" I observed was a |
Oh good. By which I mean, oh no.
No worries—I think @nvanbenschoten was worried about other panics in the storage package. Of which there are many:
@nvanbenschoten |
This reproduced, on a different node, soon after startup. No longer on the count(*) branch, either.
|
Are there logs of this or a cluster that I can look at? |
Ah,
So first of all, jordan-tpcc:3 is actually node2. I mention this because I can't really say much about n1 (jordan-tpcc:1) -- the first entry in the logs for r604 is actually the crash! There's literally nothing else in them about that range. Nothing! Even though it has logs going back as much as the other nodes. But at the same time, n1 is running right now and the range is quiescent (though n2 is down), at lease index 40738. So whatever is happening seems to be able to rectify itself. Now node2 (jordan-tpcc3) actually has had the same problem:
It's interesting that it's apparently able to get a lease later. For r604, something that caught my eye were these rapid repeated lease acquisitions (or lease acquisition attempts). I don't know if that's related, but definitely needs to be understood as well.
@jordanlewis please don't touch or restart n3, I'm out for a bit but will take a look at the raft log there. |
Raft log shows that the last entry is an empty one at
The log around the jump doesn't have a gap (as expected, or the other nodes would've also crashed):
|
[deleted] because wrong thread. Wanted #28995. |
As expected, I wasn't able to repro this by manually importing TPCC and running some
but there's no indication of an ongoing schema change. I think this is likely the DROP operation, though the event log doesn't say anything about that not having completed: |
upstream moved from github.com/etcd to go.etcd.io/etcd. We're going to have to pick up an update soon to fix cockroachdb#28918, so it's easier to switch now. We're not picking up any new commits except for the renames. Release note: None
29574: storage: update import path for Raft r=petermattis a=tschottdorf upstream moved from github.com/etcd to go.etcd.io/etcd. We're going to have to pick up an update soon to fix #28918, so it's easier to switch now. We're not picking up any new commits except for the renames. Release note: None 29585: gossip: avoid allocation of UnresolvedAddr in getNodeIDAddressLocked r=nvanbenschoten a=nvanbenschoten `getNodeIDAddressLocked` is called from `Dialer.ConnHealth` and `Dialer.DialInternalClient`. It was responsible for **1.71%** of all allocations (`alloc_objects`) on a 3-node long-running cluster that was running TPC-C 1K. Pointing into `nd.LocalityAddress` is safe because even if the `NodeDescriptor` itself is replaced in `Gossip`, the struct is never internally mutated. This is the same reason why taking the address of `nd.Address` was already safe. Release note (performance improvement): Avoid allocation when checking RPC connection health. Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Downgrading, but leaving open. We also need roachtesting for this (I hope that this will fall out of Nikhil's upcoming merge queue work to reproduce #29252. |
upstream moved from github.com/etcd to go.etcd.io/etcd. We're going to have to pick up an update soon to fix cockroachdb#28918, so it's easier to switch now. We're not picking up any new commits except for the renames. Release note: None
This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes cockroachdb#28918. Release note: None
29579: storage: return one entry less in Entries r=petermattis a=tschottdorf This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes #28918. Release note: None 29631: cli: handle merged range descriptors in debug keys r=petermattis a=tschottdorf Noticed during #29252. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes cockroachdb#28918. Release note: None
29332: roachtest: tpcc with chaos r=benesch a=tschottdorf This is an attempt to reproduce #28918. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes cockroachdb#28918. Release note: None # # Commit message recommendations: # # --- # <pkg>: <short description> # # <long description> # # Release note (category): <release note description> # --- # # Wrap long lines! 72 columns is best. # # The release note must be present if your commit has # user-facing changes. Leave the default above if not. # # Categories for release notes: # - cli change # - sql change # - admin ui change # - general change (e.g., change of required Go version) # - build change (e.g., compatibility with older CPUs) # - enterprise change (e.g., change to backup/restore) # - backwards-incompatible change # - performance improvement # - bug fix # # Commit message recommendations: # # --- # <pkg>: <short description> # # <long description> # # Release note (category): <release note description> # --- # # Wrap long lines! 72 columns is best. # # The release note must be present if your commit has # user-facing changes. Leave the default above if not. # # Categories for release notes: # - cli change # - sql change # - admin ui change # - general change (e.g., change of required Go version) # - build change (e.g., compatibility with older CPUs) # - enterprise change (e.g., change to backup/restore) # - backwards-incompatible change # - performance improvement # - bug fix
@tbg can we close this issue? |
Running a 3-node TPCC cluster on roachprod. Under no load, one of the nodes panicked:
The text was updated successfully, but these errors were encountered: