-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic runtime error: index out of range when running 6 node tpc-c without range merging #34241
Comments
I extended this so it'd be up tomorrow |
Isn't this the same as the other one you filed? |
Not the same, note the different receiver on
The operation that's carried out here is cockroach/pkg/roachpb/api.pb.go Lines 14522 to 14530 in 3ac97e6
which in particular means that The panic then happens on this line cockroach/pkg/roachpb/api.pb.go Line 10150 in 3ac97e6
which writes into what should be the same That this would be out of bounds seems to say that the There could also be some bug with this new cockroach/pkg/roachpb/api.pb.go Line 1144 in 3ac97e6
I don't know how that works exactly, but it pattern matches and is also newly introduced since we bumped the dependencies a few weeks back. @RaduBerinde do you know more about this? @jordanlewis do we pool ScanRequest (or any part of BatchRequest really) before passing it to KV? |
As far as I found in the proto code, when marshaling it first goes through the proto (recursively) to figure out the length of the marshaled data, and then goes through it again to do the actual marshaling. The second step also needs to know the size of each structure in the tree. If the structure has a I believe the value is only relevant during marshaling. I don't see it causing a problem unless a proto is changed while it is being marshaled (which sounds bad regardless). If the same proto is being marshaled twice in parallel, both threads could write to it perhaps but they'd be writing the same value. That being said, you can disable |
This is the relevant code, I don't think there's anything else using |
There's no object pooling for ScanRequest (or anything with protos, I don't think). I did add a bunch of pooling recently but it was all objects higher up in the stack. |
Issue #34256 seems to indicate it isn't a pooling issue - the whole stack is in a single goroutine, including initialization from fresh memory of the proto in question. |
So one thing I realized this weekend is that I wasn't pointing the workload at all of the nodes. I ran:
instead of
|
So the only thing that's marshalled before the out of bounds is the start and end key of the RequestHeader. This means that any kind of potential mutation between the size calculation and the marshalling would have to occur on the span keys. Let's see if that's possible. We set the span of a RequestHeader in just one place: cockroach/pkg/sql/row/kv_batch_fetcher.go Lines 208 to 218 in 9f084ad
So, we're always setting it to the values in The spans are written at the initialization of a cockroach/pkg/sql/row/kv_batch_fetcher.go Lines 179 to 197 in 9f084ad
So at this point we should be confident that a mutation to the container that got passed into the I looked around for instances of that and couldn't find anything. But, it's worth noting that there's something suspicious going on with these spans - I did add some object pooling for TableReaders in September (#30676) which attempts to reuse the memory for the All this being said, I still don't get what's going on. |
Ugh. Unfortunately I also can't see where the problem with that would be. Can you (find someone who has time to) wrap this repro in a script and in particular run TPCC itself with a race build (after verifying that it sometimes repro without the race build)? Maybe we'll get lucky. We could also catch the panic from that particular proto and print the actual size of My assumption is that the size cache stuff exposes some preexisting condition. PS I did look through DistSender and in particular its |
We've now seen this in the wild. #34241 |
@jordanlewis that's linking back to this issue, what's the real issue number? |
Oops! #34674 |
Yikes. What do we do with this? Seems that at this point we want some more eyes on the problem and a way to get a core dump if it happens. |
A core dump would definitely help. I (or someone else) should get on #34680. I'll try and find time to reproduce on Friday. |
I'm very confused -- I was looking at this again and can't find Not that this would matter for the kind of bug at hand (if the protos change out from under you during a marshal, things are going to hurt), but it seems that we never actually turned on the sizecache, contrary to what the commit message suggested. cc @RaduBerinde |
Taking a second look, this isn't actually correct. We initialize the memory right at the beginning of |
You may be interested to know that I just hit this on a cluster composed of 4-core nodes. |
A clear contract around these things would be gold. Some previous discussions was in #30485 and then I had to revert some crap in #30773 because I introduced some mysterious races. And I never got back to investigating them and reverting the revert. If you also want to take over #30773, you'd get my peer ack. |
This disables a portion of the additional checking added in cockroachdb#35202 because it increases the rate of panics. This case should be re-enabled with the complete resolution of cockroachdb#34241. Release note: None
35663: kv: disable size check on marshaling to reduce panics r=ajwerner a=ajwerner This disables a portion of the additional checking added in #35202 because it increases the rate of panics. This case should be re-enabled with the complete resolution of #34241. Release note: None Co-authored-by: Andrew Werner <[email protected]>
This commit adds support for running `kv` with a `--batch` parameter. It then adds the following new roachtest configurations: - kv0/enc=false/nodes=3/batch=16 - kv95/enc=false/nodes=3/batch=16 - kv0/enc=false/nodes=3/cpu=96 - kv95/enc=false/nodes=3/cpu=96 - kv50/enc=false/nodes=4/cpu=96/batch=64 The last test is currently skipped because of cockroachdb#34241. I confirmed that it triggers the corresponding assertion on both AWS and GCE. My request for more m5d.24xlarge quota just succeeded, but I may need to request more quota for n1-highcpu-96 VMs for these to run nightly. Release note: None
35300: sql: add checks after all referenced columns have been backfilled r=lucy-zhang a=lucy-zhang Previously, if a column was added with a check constraint that also referenced another column that was public, writes to that public column would erroneously fail (and, in the worst case, result in a panic) while the column being added was not yet public. With this change, the schema changer will now wait to add the check constraint to the table descriptor until after all the columns that were added in the same transaction have been backfilled. A new optional field has been added to `ConstraintToValidate` for the check constraint itself, so that it can be added to the table descriptor at the correct step in the schema change process. I ended up adding this field to the existing mutation instead of creating a new type of mutation to add the constraint to the descriptor, since it ultimately seemed to me that a mutation that simply adds a schema element in its backfill step would be too inconsistent with what mutations are, especially since all the mutations for a single transaction are basically executed at the same time. To support NOT VALID in the future, we could add more flags to the protobuf to indicate that either the addition of the constraint or the validation should be skipped, so that they can be executed separately. Fixes #35258, fixes #35193 Release note: None 35682: engineccl/mvcc: fix time-bound iterator's interaction with moving intents r=nvanbenschoten a=nvanbenschoten Fixes #34819. 349ff61 introduced a workaround for #28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in #34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterator's perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None 35689: roachtest: add large node kv tests and batching kv tests r=nvanbenschoten a=nvanbenschoten This commit adds support for running `kv` with a `--batch` parameter. It then adds the following new roachtest configurations: - kv0/enc=false/nodes=3/batch=16 - kv95/enc=false/nodes=3/batch=16 - kv0/enc=false/nodes=3/cpu=96 - kv95/enc=false/nodes=3/cpu=96 - kv50/enc=false/nodes=4/cpu=96/batch=64 The last test is currently skipped because of #34241. I confirmed that it triggers the corresponding assertion on both AWS and GCE. My request for more m5d.24xlarge quota just succeeded, but I may need to request more quota for n1-highcpu-96 VMs for these to run nightly. Release note: None Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes cockroachdb#34241. This missing clone is what allows the request transaction to enter the response transaction and create the data race that was causing the panic. This commit adds the necessary clone and removes the corresponding assertion. Release note: None
Fixes cockroachdb#34241. This missing clone is what allows the request transaction to enter the response transaction and create the data race that was causing the panic. This commit adds the necessary clone and removes the corresponding assertion. Release note: None
35688: sql: specify hash or merge join in EXPLAIN r=RaduBerinde a=RaduBerinde Replace `join` with `hash-join` or `merge-join` in `EXPLAIN` output. Fixes #35683. Release note (sql change): EXPLAIN tree now uses `hash-join` or `merge-join` instead of `join`. 35719: roachpb: prevent data race on Transaction r=nvanbenschoten a=nvanbenschoten Fixes #34241. This PR starts off with a series of small cleanups related to ownership of `roachpb.Transaction` objects and the use of deep and shallow clones. This makes up the first 5 commits. Next, the 6th commit removes redundant calls to `Transaction.UpdateObservedTimestamp`, reducing it down to have just a single caller that is easy to reason about. The 7th commit is what actually fixes the referenced issue. It adds in the proto clone that was missing from `BatchResponse_Header.combine` and allowing a `BatchRequest` to reference the same `Transaction` object as a `BatchResponse`. I confirmed a number of times that this stops the assertion from firing, so the commit also re-enables the skipped roachtest and removes the assertion. ~The final two commits are the two that we might consider waiting on and not including in this release, but they're also the most exciting. By making `Transaction.ObservedTimestamps` immutable (which it almost already was), we can prohibit all interior mutability of references within `Transaction`, give it value semantics, and eliminate the distinction between "shallow" and "deep" object cloning. This reduces the cost of a clone to a single straightforward allocation and makes working with the object easier to think about.~ EDIT: these two were removed from this PR. Past this point, I think the only other change we might want to make here is making a clone of `ba.Txn` in `internalClientAdapter` and then declare that the `Batch` handler goroutine has exclusive ownership over its `ba.Txn`. This would allow us to avoid a few conservative clones that would no longer be necessary, like [here](https://github.com/cockroachdb/cockroach/blob/57e825a7940495b67e0cc7213a5fabc24e12be0e/pkg/storage/store.go#L2827) and [here](https://github.com/cockroachdb/cockroach/blob/57e825a7940495b67e0cc7213a5fabc24e12be0e/pkg/storage/replica.go#L1309). I did not make this change here. 35736: opt: catch all pgerror.Error in optbuilder r=RaduBerinde a=RaduBerinde We now catch all `pgerror.Error`s in optbuilder, which means that we don't need to use the `buildError` wrapper with them. The wrapper still exists when external calls (e.g. for name resolution) return a generic error. The main motivation is that optbuilder calls into the factory which can panic internally. We will want to switch those panics to assertion errors as well, but they are in different packages. The existing approach would have required a shared, exported wrapper. Release note: None Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes cockroachdb#34241. This missing clone is what allows the request transaction to enter the response transaction and create the data race that was causing the panic. This commit adds the necessary clone and removes the corresponding assertion. Release note: None
This debugging function was removed when we fixed cockroachdb#34241 and added back in when cockroachdb#35803 appeared because it was clear that we hadn't fully fixed the issue. It's been about 2 months since cockroachdb#35762 merged and we haven't seen any issues since, so this can now be removed. I don't think we meant to keep this in for the 19.1 release. We should backport this commit. Release note: None
37559: kv: remove withMarshalingDebugging function r=nvanbenschoten a=nvanbenschoten This debugging function was removed when we fixed #34241 and added back in when #35803 appeared because it was clear that we hadn't fully fixed the issue. It's been about 2 months since #35762 merged and we haven't seen any issues since, so this can now be removed. I don't think we meant to keep this in for the 19.1 release. We should backport this commit. Co-authored-by: Nathan VanBenschoten <[email protected]>
I'm not aware of this causing issues, but we're pretty careful with this in other places (like BatchResponse_Header.combine) to avoid issues like what we saw in cockroachdb#34241, so this seems like a good change. Release note: None
I'm not aware of this causing issues, but we're pretty careful with this in other places (like BatchResponse_Header.combine) to avoid issues like what we saw in cockroachdb#34241, so this seems like a good change. Release note: None
Describe the problem
Node died while running tpc-c on a six node cluster with range merging turned off. In the shell I saw
To Reproduce
Expected behavior
Completing tpc-c
Additional data / screenshots
Environment:
v2.2.0-alpha.20181217-820-g645c0c9
From log to stderr
https://gist.github.com/awoods187/a44b3505f7921aa152d8c0b488afdccc
The text was updated successfully, but these errors were encountered: