Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic runtime error: index out of range when running 6 node tpc-c without range merging #34241

Closed
awoods187 opened this issue Jan 25, 2019 · 55 comments · Fixed by #35719 or #37559
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@awoods187
Copy link
Contributor

awoods187 commented Jan 25, 2019

Describe the problem

Node died while running tpc-c on a six node cluster with range merging turned off. In the shell I saw

Error: error in payment: EOF
Error:  exit status 1

To Reproduce

export CLUSTER=andy-base
roachprod create $CLUSTER -n 7 --clouds=aws  --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER:1-6 -- 'sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /dev/nvme1n1 /mnt/data1/; mount | grep /mnt/data1'
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod sql $CLUSTER:1 -- -e 'SET CLUSTER SETTING kv.range_merge.queue_enabled = false'
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=5m --warehouses=3400 --duration=15m --split --scatter {pgurl:1-6}"

Expected behavior
Completing tpc-c

Additional data / screenshots
image

Environment:
v2.2.0-alpha.20181217-820-g645c0c9

From log to stderr
https://gist.github.com/awoods187/a44b3505f7921aa152d8c0b488afdccc

ERROR: [n5,client=172.31.42.127:50898,user=root,txn=44987ebc] a panic has occurred!
runtime error: index out of range
@awoods187
Copy link
Contributor Author

I extended this so it'd be up tomorrow andy-nomerge

@jordanlewis
Copy link
Member

Isn't this the same as the other one you filed?

@tbg
Copy link
Member

tbg commented Jan 25, 2019

Not the same, note the different receiver on MarshalTo.

panic(0x2d24ec0, 0x5486fa0)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/cockroachdb/cockroach/pkg/roachpb.(*ScanRequest).MarshalTo(0xc0219b1f40, 0xc032607cc2, 0x15, 0x15, 0x17, 0x2, 0x13)
panic(0x2d24ec0, 0x5486fa0)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/cockroachdb/cockroach/pkg/roachpb.(*ScanRequest).MarshalTo(0xc08bb7ba80, 0xc014e5a284, 0x15, 0x15, 0x17, 0x2, 0x13)

The operation that's carried out here is

cockroach/pkg/roachpb/api.pb.go

Lines 14522 to 14530 in 3ac97e6

func (m *BatchRequest) Marshal() (dAtA []byte, err error) {
size := m.Size()
dAtA = make([]byte, size)
n, err := m.MarshalTo(dAtA)
if err != nil {
return nil, err
}
return dAtA[:n], nil
}

which in particular means that dAtA is freshly allocated memory.

The panic then happens on this line

dAtA[i] = 0x20

which writes into what should be the same dAtA slice.

That this would be out of bounds seems to say that the m.Size() in the first snippet is either wrong orgoes stale as we actually write the data. Something like this would be expected if we put stuff in the ScanRequest that comes from some buffer pool, and we give the request to grpc while mutating the scan request.

There could also be some bug with this new XXX_sizecache thing:

XXX_sizecache int32 `json:"-"`

I don't know how that works exactly, but it pattern matches and is also newly introduced since we bumped the dependencies a few weeks back. @RaduBerinde do you know more about this?

@jordanlewis do we pool ScanRequest (or any part of BatchRequest really) before passing it to KV?

@RaduBerinde
Copy link
Member

As far as I found in the proto code, when marshaling it first goes through the proto (recursively) to figure out the length of the marshaled data, and then goes through it again to do the actual marshaling. The second step also needs to know the size of each structure in the tree. If the structure has a XXX_sizecache, the size is stored in there in the first step and retrieved in the second step instead of being recalculated.

I believe the value is only relevant during marshaling. I don't see it causing a problem unless a proto is changed while it is being marshaled (which sounds bad regardless). If the same proto is being marshaled twice in parallel, both threads could write to it perhaps but they'd be writing the same value.

That being said, you can disable XXX_sizecache on a per-proto basis (we could also do it for all our protos if we choose to). Supposedly, removing it does come with a small regression during marshaling.

@RaduBerinde
Copy link
Member

This is the relevant code, I don't think there's anything else using XXX_sizecache: https://github.com/cockroachdb/vendored/blob/master/github.com/gogo/protobuf/proto/table_marshal.go#L223

@jordanlewis
Copy link
Member

There's no object pooling for ScanRequest (or anything with protos, I don't think). I did add a bunch of pooling recently but it was all objects higher up in the stack.

@jordanlewis
Copy link
Member

Issue #34256 seems to indicate it isn't a pooling issue - the whole stack is in a single goroutine, including initialization from fresh memory of the proto in question.

@awoods187
Copy link
Contributor Author

So one thing I realized this weekend is that I wasn't pointing the workload at all of the nodes. I ran:

roachprod run $CLUSTER:7 "./workload run tpcc --ramp=5m --warehouses=3400 --duration=15m --split --scatter {pgurl:1-3}"

instead of

roachprod run $CLUSTER:7 "./workload run tpcc --ramp=5m --warehouses=3400 --duration=15m --split --scatter {pgurl:1-6}"

@jordanlewis
Copy link
Member

So the only thing that's marshalled before the out of bounds is the start and end key of the RequestHeader. This means that any kind of potential mutation between the size calculation and the marshalling would have to occur on the span keys. Let's see if that's possible.

We set the span of a RequestHeader in just one place:

for i := range f.spans {
scans[i].ScanFormat = roachpb.BATCH_RESPONSE
scans[i].SetSpan(f.spans[i])
ba.Requests[i].MustSetInner(&scans[i])
}
} else {
scans := make([]roachpb.ScanRequest, len(f.spans))
for i := range f.spans {
scans[i].ScanFormat = roachpb.BATCH_RESPONSE
scans[i].SetSpan(f.spans[i])
ba.Requests[i].MustSetInner(&scans[i])

So, we're always setting it to the values in txnKVFetcher.spans. Where do those come from? Two places, either from the resumeSpan of a ScanResponse, which seems like it would never get mutated since that's itself returned memory from gRPC, or from the spans passed in by the table reader. Let's examine that latter case.

The spans are written at the initialization of a txnKVFetcher, by copying them into a fresh container:

// Make a copy of the spans because we update them.
copySpans := make(roachpb.Spans, len(spans))
for i := range spans {
if reverse {
// Reverse scans receive the spans in decreasing order.
copySpans[len(spans)-i-1] = spans[i]
} else {
copySpans[i] = spans[i]
}
}
return txnKVFetcher{
txn: txn,
spans: copySpans,
reverse: reverse,
useBatchLimit: useBatchLimit,
firstBatchLimit: firstBatchLimit,
returnRangeInfo: returnRangeInfo,
}, nil

So at this point we should be confident that a mutation to the container that got passed into the txnKVFetcher shouldn't matter, right? Since everything's passed by value. So, the only case that could potentially get us is if somebody mutates the underlying span's Key/EndKey bytes.

I looked around for instances of that and couldn't find anything.

But, it's worth noting that there's something suspicious going on with these spans - I did add some object pooling for TableReaders in September (#30676) which attempts to reuse the memory for the roachpb.Spans slice. That being said, I don't see where the problem could be even still, as the sole consumer of this slice makes a (shallow) copy before using it further.

All this being said, I still don't get what's going on.

@tbg
Copy link
Member

tbg commented Jan 28, 2019

Ugh. Unfortunately I also can't see where the problem with that would be. Can you (find someone who has time to) wrap this repro in a script and in particular run TPCC itself with a race build (after verifying that it sometimes repro without the race build)? Maybe we'll get lucky.

We could also catch the panic from that particular proto and print the actual size of dAtA vs what the proto's new Size() is, but the result would necessarily have to be that the new size is larger than dAtA.

My assumption is that the size cache stuff exposes some preexisting condition.

PS I did look through DistSender and in particular its truncate method, but that takes an appropriate shallow copy of all requests it truncates.

@jordanlewis
Copy link
Member

We've now seen this in the wild. #34241

@tbg
Copy link
Member

tbg commented Feb 6, 2019

@jordanlewis that's linking back to this issue, what's the real issue number?

@jordanlewis
Copy link
Member

Oops! #34674

@tbg
Copy link
Member

tbg commented Feb 6, 2019

Yikes. What do we do with this? Seems that at this point we want some more eyes on the problem and a way to get a core dump if it happens.

@petermattis
Copy link
Collaborator

A core dump would definitely help. I (or someone else) should get on #34680. I'll try and find time to reproduce on Friday.

@jordanlewis jordanlewis added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Feb 8, 2019
@tbg
Copy link
Member

tbg commented Feb 26, 2019

I'm very confused -- I was looking at this again and can't find XXX_sizecache being used at all on current master? I was so sure I was missing something that I removed the XXX_sizecache field from `BatchRequest, but everything just works, so definitely there isn't any generated code hooked up to it. I did the same in db4a8a1 and guess what, it wasn't ever hooked up from the looks of it.

Not that this would matter for the kind of bug at hand (if the protos change out from under you during a marshal, things are going to hurt), but it seems that we never actually turned on the sizecache, contrary to what the commit message suggested. cc @RaduBerinde

@tbg
Copy link
Member

tbg commented Feb 26, 2019

So the only thing that's marshalled before the out of bounds is the start and end key of the RequestHeader.

Taking a second look, this isn't actually correct. We initialize the memory right at the beginning of BatchRequest.Marshal and unmarshal everything into it. It crashes eventually on some ScanRequest, but it doesn't say at all that it's the first one. Just that we eventually ran out of memory. So all we know (well, suspect) is that some memory used by the BatchRequest changes while it's being marshaled (and that it changes in a way that would really require a larger buffer than was provisioned).

@tbg tbg assigned nvanbenschoten and unassigned jordanlewis and tbg Mar 12, 2019
@ajwerner
Copy link
Contributor

You may be interested to know that I just hit this on a cluster composed of 4-core nodes.

@andreimatei
Copy link
Contributor

Either way, want me to take over this issue?

That'd be great. Happy to review.

A clear contract around these things would be gold. Some previous discussions was in #30485 and then I had to revert some crap in #30773 because I introduced some mysterious races. And I never got back to investigating them and reverting the revert. If you also want to take over #30773, you'd get my peer ack.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 12, 2019
This disables a portion of the additional checking added in cockroachdb#35202 because it
increases the rate of panics. This case should be re-enabled with the complete
resolution of cockroachdb#34241.

Release note: None
craig bot pushed a commit that referenced this issue Mar 12, 2019
35663: kv: disable size check on marshaling to reduce panics r=ajwerner a=ajwerner

This disables a portion of the additional checking added in #35202 because it
increases the rate of panics. This case should be re-enabled with the complete
resolution of #34241.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 13, 2019
This commit adds support for running `kv` with a `--batch` parameter. It
then adds the following new roachtest configurations:
- kv0/enc=false/nodes=3/batch=16
- kv95/enc=false/nodes=3/batch=16
- kv0/enc=false/nodes=3/cpu=96
- kv95/enc=false/nodes=3/cpu=96
- kv50/enc=false/nodes=4/cpu=96/batch=64

The last test is currently skipped because of cockroachdb#34241. I confirmed that
it triggers the corresponding assertion on both AWS and GCE.

My request for more m5d.24xlarge quota just succeeded, but I may need to
request more quota for n1-highcpu-96 VMs for these to run nightly.

Release note: None
craig bot pushed a commit that referenced this issue Mar 13, 2019
35300: sql: add checks after all referenced columns have been backfilled r=lucy-zhang a=lucy-zhang

Previously, if a column was added with a check constraint that also referenced
another column that was public, writes to that public column would erroneously
fail (and, in the worst case, result in a panic) while the column being added
was not yet public. With this change, the schema changer will now wait to add
the check constraint to the table descriptor until after all the columns that
were added in the same transaction have been backfilled.

A new optional field has been added to `ConstraintToValidate` for the check
constraint itself, so that it can be added to the table descriptor at the
correct step in the schema change process.

I ended up adding this field to the existing mutation instead of creating a new
type of mutation to add the constraint to the descriptor, since it ultimately
seemed to me that a mutation that simply adds a schema element in its backfill
step would be too inconsistent with what mutations are, especially since all
the mutations for a single transaction are basically executed at the same time.
To support NOT VALID in the future, we could add more flags to the protobuf to
indicate that either the addition of the constraint or the validation should be
skipped, so that they can be executed separately.

Fixes #35258, fixes #35193

Release note: None

35682: engineccl/mvcc: fix time-bound iterator's interaction with moving intents r=nvanbenschoten a=nvanbenschoten

Fixes #34819.

349ff61 introduced a workaround for #28358 into MVCCIncrementalIterator.
This workaround created a second (non-time-bound) iterator to verify
possibly-phantom MVCCMetadata keys during iteration.

We found in #34819 that it is necessary for correctness that sanityIter
be created before the original iter. This is because the provided Reader
that both iterators are created from may not be a consistent snapshot, so
the two iterators could end up observing different information. The hack
around sanityCheckMetadataKey only works properly if all possible
discrepancies between the two iterators lead to intents and values falling
outside of the timestamp range **from the time-bound iterator's perspective**.
This allows us to simply ignore discrepancies that we notice in advance().

This commit makes this change. It also adds a test that failed regularly
before the fix under stress and no longer fails after the fix.

Release note: None

35689: roachtest: add large node kv tests and batching kv tests r=nvanbenschoten a=nvanbenschoten

This commit adds support for running `kv` with a `--batch` parameter. It
then adds the following new roachtest configurations:
- kv0/enc=false/nodes=3/batch=16
- kv95/enc=false/nodes=3/batch=16
- kv0/enc=false/nodes=3/cpu=96
- kv95/enc=false/nodes=3/cpu=96
- kv50/enc=false/nodes=4/cpu=96/batch=64

The last test is currently skipped because of #34241. I confirmed that
it triggers the corresponding assertion on both AWS and GCE.

My request for more m5d.24xlarge quota just succeeded, but I may need to
request more quota for n1-highcpu-96 VMs for these to run nightly.

Release note: None

Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 14, 2019
Fixes cockroachdb#34241.

This missing clone is what allows the request transaction to enter the
response transaction and create the data race that was causing the panic.

This commit adds the necessary clone and removes the corresponding
assertion.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 14, 2019
Fixes cockroachdb#34241.

This missing clone is what allows the request transaction to enter the
response transaction and create the data race that was causing the panic.

This commit adds the necessary clone and removes the corresponding
assertion.

Release note: None
craig bot pushed a commit that referenced this issue Mar 14, 2019
35688: sql: specify hash or merge join in EXPLAIN r=RaduBerinde a=RaduBerinde

Replace `join` with `hash-join` or `merge-join` in `EXPLAIN` output.

Fixes #35683.

Release note (sql change): EXPLAIN tree now uses `hash-join` or
`merge-join` instead of `join`.

35719: roachpb: prevent data race on Transaction r=nvanbenschoten a=nvanbenschoten

Fixes #34241.

This PR starts off with a series of small cleanups related to ownership of `roachpb.Transaction` objects and the use of deep and shallow clones. This makes up the first 5 commits.

Next, the 6th commit removes redundant calls to `Transaction.UpdateObservedTimestamp`, reducing it down to have just a single caller that is easy to reason about.

The 7th commit is what actually fixes the referenced issue. It adds in the proto clone that was missing from `BatchResponse_Header.combine` and allowing a `BatchRequest` to reference the same `Transaction` object as a `BatchResponse`. I confirmed a number of times that this stops the assertion from firing, so the commit also re-enables the skipped roachtest and removes the assertion.

~The final two commits are the two that we might consider waiting on and not including in this release, but they're also the most exciting. By making `Transaction.ObservedTimestamps` immutable (which it almost already was), we can prohibit all interior mutability of references within `Transaction`, give it value semantics, and eliminate the distinction between "shallow" and "deep" object cloning. This reduces the cost of a clone to a single straightforward allocation and makes working with the object easier to think about.~

EDIT: these two were removed from this PR.

Past this point, I think the only other change we might want to make here is making a clone of `ba.Txn` in `internalClientAdapter` and then declare that the `Batch` handler goroutine has exclusive ownership over its `ba.Txn`. This would allow us to avoid a few conservative clones that would no longer be necessary, like [here](https://github.com/cockroachdb/cockroach/blob/57e825a7940495b67e0cc7213a5fabc24e12be0e/pkg/storage/store.go#L2827) and [here](https://github.com/cockroachdb/cockroach/blob/57e825a7940495b67e0cc7213a5fabc24e12be0e/pkg/storage/replica.go#L1309). I did not make this change here.

35736: opt: catch all pgerror.Error in optbuilder r=RaduBerinde a=RaduBerinde

We now catch all `pgerror.Error`s in optbuilder, which means that we
don't need to use the `buildError` wrapper with them. The wrapper
still exists when external calls (e.g. for name resolution) return a
generic error.

The main motivation is that optbuilder calls into the factory which
can panic internally. We will want to switch those panics to assertion
errors as well, but they are in different packages. The existing
approach would have required a shared, exported wrapper.

Release note: None

Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #35719 Mar 14, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 13, 2019
Fixes cockroachdb#34241.

This missing clone is what allows the request transaction to enter the
response transaction and create the data race that was causing the panic.

This commit adds the necessary clone and removes the corresponding
assertion.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 17, 2019
This debugging function was removed when we fixed cockroachdb#34241 and added back
in when cockroachdb#35803 appeared because it was clear that we hadn't fully fixed
the issue. It's been about 2 months since cockroachdb#35762 merged and we haven't
seen any issues since, so this can now be removed.

I don't think we meant to keep this in for the 19.1 release. We should
backport this commit.

Release note: None
craig bot pushed a commit that referenced this issue May 17, 2019
37559: kv: remove withMarshalingDebugging function r=nvanbenschoten a=nvanbenschoten

This debugging function was removed when we fixed #34241 and added back
in when #35803 appeared because it was clear that we hadn't fully fixed
the issue. It's been about 2 months since #35762 merged and we haven't
seen any issues since, so this can now be removed.

I don't think we meant to keep this in for the 19.1 release. We should
backport this commit.

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 8, 2019
I'm not aware of this causing issues, but we're pretty careful with this
in other places (like BatchResponse_Header.combine) to avoid issues like
what we saw in cockroachdb#34241, so this seems like a good change.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 25, 2019
I'm not aware of this causing issues, but we're pretty careful with this
in other places (like BatchResponse_Header.combine) to avoid issues like
what we saw in cockroachdb#34241, so this seems like a good change.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
8 participants