-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic in MarshalTo(0xc13312b400, 0xc1541c14ec, 0x13, 0x13, 0x15, 0x2, 0x11) #35803
Comments
@awoods187 Please always include
The portion in parentheses are the arguments to |
I added the version (it's today's master post the pr merge). I thought that'a what @tbg said here: #34241 (comment) |
@nvanbenschoten it looks like this SHA, regrettably, has your fix. Could you take a look? |
Yeah, I'm taking a look. I'm starting by adding back in the |
@awoods187 just saw this again in #35890 (comment). I added the I'm going to add the extra assertion back in to master and try to catch this using tpcc, Andy Woods style. |
35910: Revert "roachpb: clone Txn object in BatchResponse_Header.combine" r=nvanbenschoten a=nvanbenschoten This reverts the removal of the assertion in commit 05a93a6. The actual clone of the Txn object in `BatchResponse_Header.combine` is kept because that was again verified to fix _some_ issue in #35803 (comment). Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
We saw this fire again in #32135 (comment).
Unfortunately, we were only logging the new proto (size 195) because the extra protobuf marshalling present in Tobi's fork was too expensive to merge into master. The protobuf we do see looked like:
One interesting thing that we see from the stack trace is that this is a panic on the non-
I think the second point is key. We had been assuming that a BatchRequest that hit the local RPC fast-path (i.e. did not serialize its proto) could never add a brand new observed timestamp here. We made this assumption because a new transaction always starts with an observed timestamp for its local node. But this isn't true for leaf transactions. What this means is that it doesn't seem too crazy that we could be getting all the way to here on a local RPC: Lines 1767 to 1770 in 7b689f2
We also have a bit more evidence to support this. In #34241 (comment), we saw that
Take a look at which node crashed - node 4! The node whose observed timestamp was added between the original marshalling and the new marshalling of the BatchRequest. Conveniently, we actually already have the solution for this: #35762. I think we should leave the assertion in for a bit longer and also push to get that PR into 19.1. How do you feel about that @tbg? We could also restrict the backport to just the update to |
Let's do it, thanks for figuring this out! |
35762: roachpb: refine Transaction proto cloning r=nvanbenschoten a=nvanbenschoten Fixes #35803. This PR includes the final two commits from #35719. > By making Transaction.ObservedTimestamps immutable (which it almost already was), we can prohibit all interior mutability of references within Transaction, give it value semantics, and eliminate the distinction between "shallow" and "deep" object cloning. This reduces the cost of a clone to a single straightforward allocation and makes working with the object easier to think about. cc. @tbg Co-authored-by: Nathan VanBenschoten <[email protected]>
This debugging function was removed when we fixed cockroachdb#34241 and added back in when cockroachdb#35803 appeared because it was clear that we hadn't fully fixed the issue. It's been about 2 months since cockroachdb#35762 merged and we haven't seen any issues since, so this can now be removed. I don't think we meant to keep this in for the 19.1 release. We should backport this commit. Release note: None
37559: kv: remove withMarshalingDebugging function r=nvanbenschoten a=nvanbenschoten This debugging function was removed when we fixed #34241 and added back in when #35803 appeared because it was clear that we hadn't fully fixed the issue. It's been about 2 months since #35762 merged and we haven't seen any issues since, so this can now be removed. I don't think we meant to keep this in for the 19.1 release. We should backport this commit. Co-authored-by: Nathan VanBenschoten <[email protected]>
I hit this on a 7 node 36 cluster running tpc-c 10k on master v19.1.0-beta.20190304-458-g70e3468
I'm pretty sure it's different than #34241 because the portion after MarshalTo is different.cockroach.log
The text was updated successfully, but these errors were encountered: