Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentry: stopper.go:182: runtime.errorString: runtime error: invalid memory address or nil pointer dereference #35557

Closed
cockroach-teamcity opened this issue Mar 8, 2019 · 1 comment · Fixed by #35630
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.

Comments

@cockroach-teamcity
Copy link
Member

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/923688019/?project=164528&referrer=webhooks_plugin

Panic message:

stopper.go:182: runtime.errorString: runtime error: invalid memory address or nil pointer dereference

Stacktrace (expand for inline code snippets):

/usr/local/go/src/runtime/asm_amd64.s#L572-L574 in runtime.call32
/usr/local/go/src/runtime/panic.go#L501-L503 in runtime.gopanic

cockroach/pkg/storage/store.go

Lines 3046 to 3048 in 0c87b11

// again due to undefined state.
panic(r)
}
in pkg/storage.(*Store).Send.func1
/usr/local/go/src/runtime/asm_amd64.s#L573-L575 in runtime.call64
/usr/local/go/src/runtime/panic.go#L501-L503 in runtime.gopanic
/usr/local/go/src/runtime/panic.go#L62-L64 in runtime.panicmem
/usr/local/go/src/runtime/signal_unix.go#L387-L389 in runtime.sigpanic

cockroach/pkg/storage/store.go

Lines 3116 to 3118 in 0c87b11

// wasn't read-only.
if pErr == nil && ba.Txn != nil && br.Txn.Status == roachpb.PENDING && !ba.IsReadOnly() {
cleanupAfterWriteIntentError(nil, &br.Txn.TxnMeta)
in pkg/storage.(*Store).Send.func2
/usr/local/go/src/runtime/asm_amd64.s#L572-L574 in runtime.call32
/usr/local/go/src/runtime/panic.go#L501-L503 in runtime.gopanic
/usr/local/go/src/runtime/panic.go#L34-L36 in runtime.panicslice
https://github.com/cockroachdb/cockroach/blob/0c87b11cb99ba5c677c95ded55dcba385928474e/vendor/github.com/andy-kimball/arenaskl/arena.go#L83-L85 in vendor/github.com/andy-kimball/arenaskl.(*Skiplist).allocVal
https://github.com/cockroachdb/cockroach/blob/0c87b11cb99ba5c677c95ded55dcba385928474e/vendor/github.com/andy-kimball/arenaskl/skl.go#L198-L200 in vendor/github.com/andy-kimball/arenaskl.(*Skiplist).allocVal
https://github.com/cockroachdb/cockroach/blob/0c87b11cb99ba5c677c95ded55dcba385928474e/vendor/github.com/andy-kimball/arenaskl/iterator.go#L237-L239 in vendor/github.com/andy-kimball/arenaskl.(*Iterator).Set
err := it.Set(b, newMeta)
switch err {
in pkg/storage/tscache.(*sklPage).ratchetValueSet
// the node is already initialized then this is a no-op.
ratchetErr := p.ratchetValueSet(it, onlyIfUninitialized,
prevGapVal, prevGapVal, false /* setInit */)
in pkg/storage/tscache.(*sklPage).scanTo
// Iterate forwards to key, remembering the last gap value.
prevGapVal, _ := p.scanTo(it, key, 0, cacheValue{})
return prevGapVal
in pkg/storage/tscache.(*sklPage).incomingGapVal
// first node >= from.
prevGapVal := p.incomingGapVal(it, from)
in pkg/storage/tscache.(*sklPage).maxInRange
return p.maxInRange(&it, from, to, opt)
}
in pkg/storage/tscache.(*sklPage).lookupTimestampRange
val2 := p.lookupTimestampRange(from, to, opt)
val, _ = ratchetValue(val, val2)
in pkg/storage/tscache.(*intervalSkl).LookupTimestampRange
func (s *intervalSkl) LookupTimestamp(key []byte) cacheValue {
return s.LookupTimestampRange(nil, key, 0)
}
in pkg/storage/tscache.(*intervalSkl).LookupTimestamp
if len(end) == 0 {
val = skl.LookupTimestamp(nonNil(start))
} else {
in pkg/storage/tscache.(*sklImpl).getMax
func (tc *sklImpl) GetMaxRead(start, end roachpb.Key) (hlc.Timestamp, uuid.UUID) {
return tc.getMax(start, end, true /* readCache */)
}
in pkg/storage/tscache.(*sklImpl).GetMaxRead
// Forward the timestamp if there's been a more recent read (by someone else).
rTS, rTxnID := r.store.tsCache.GetMaxRead(header.Key, header.EndKey)
if rTS.Forward(minReadTS) {
in pkg/storage.(*Replica).applyTimestampCache
// timestamp and possible write-too-old bool.
if bumped, pErr := r.applyTimestampCache(ctx, &ba, minTS); pErr != nil {
return nil, pErr, proposalNoRetry
in pkg/storage.(*Replica).tryExecuteWriteBatch
for count := 0; ; count++ {
br, pErr, retry := r.tryExecuteWriteBatch(ctx, ba)
switch retry {
in pkg/storage.(*Replica).executeWriteBatch
log.Event(ctx, "read-write path")
br, pErr = r.executeWriteBatch(ctx, ba)
} else if isReadOnly {
in pkg/storage.(*Replica).sendWithRangeID
) (*roachpb.BatchResponse, *roachpb.Error) {
return r.sendWithRangeID(ctx, r.RangeID, ba)
}
in pkg/storage.(*Replica).Send

cockroach/pkg/storage/store.go

Lines 3167 to 3169 in 0c87b11

}
br, pErr = repl.Send(ctx, ba)
if pErr == nil {
in pkg/storage.(*Store).Send
br, pErr := store.Send(ctx, ba)
if br != nil && br.Error != nil {
in pkg/storage.(*Stores).Send
var pErr *roachpb.Error
br, pErr = n.stores.Send(ctx, *args)
if pErr != nil {
in pkg/server.(*Node).batchInternal.func1
return f(ctx)
}
in pkg/util/stop.(*Stopper).RunTaskWithErr
if err := n.stopper.RunTaskWithErr(ctx, "node.Node: batch", func(ctx context.Context) error {
var finishSpan func(*roachpb.BatchResponse)
in pkg/server.(*Node).batchInternal

cockroach/pkg/server/node.go

Lines 1014 to 1016 in 0c87b11

br, err := n.batchInternal(ctx, args)
in pkg/server.(*Node).Batch
handler := func(ctx context.Context, req interface{}) (interface{}, error) {
return srv.(InternalServer).Batch(ctx, req.(*BatchRequest))
}
in pkg/roachpb._Internal_Batch_Handler.func1
https://github.com/cockroachdb/cockroach/blob/0c87b11cb99ba5c677c95ded55dcba385928474e/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go#L47-L49 in vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1
if prevUnaryInterceptor != nil {
return prevUnaryInterceptor(ctx, req, info, handler)
}
in pkg/rpc.NewServerWithInterceptor.func1
if prevUnaryInterceptor != nil {
return prevUnaryInterceptor(ctx, req, info, handler)
}
in pkg/rpc.NewServerWithInterceptor.func3
}
return interceptor(ctx, in, info, handler)
}
in pkg/roachpb._Internal_Batch_Handler
https://github.com/cockroachdb/cockroach/blob/0c87b11cb99ba5c677c95ded55dcba385928474e/vendor/google.golang.org/grpc/server.go#L1010-L1012 in vendor/google.golang.org/grpc.(*Server).processUnaryRPC
https://github.com/cockroachdb/cockroach/blob/0c87b11cb99ba5c677c95ded55dcba385928474e/vendor/google.golang.org/grpc/server.go#L1248-L1250 in vendor/google.golang.org/grpc.(*Server).handleStream
https://github.com/cockroachdb/cockroach/blob/0c87b11cb99ba5c677c95ded55dcba385928474e/vendor/google.golang.org/grpc/server.go#L679-L681 in vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1

/usr/local/go/src/runtime/asm_amd64.s in runtime.call32 at line 573
/usr/local/go/src/runtime/panic.go in runtime.gopanic at line 502
pkg/storage/store.go in pkg/storage.(*Store).Send.func1 at line 3047
/usr/local/go/src/runtime/asm_amd64.s in runtime.call64 at line 574
/usr/local/go/src/runtime/panic.go in runtime.gopanic at line 502
/usr/local/go/src/runtime/panic.go in runtime.panicmem at line 63
/usr/local/go/src/runtime/signal_unix.go in runtime.sigpanic at line 388
pkg/storage/store.go in pkg/storage.(*Store).Send.func2 at line 3117
/usr/local/go/src/runtime/asm_amd64.s in runtime.call32 at line 573
/usr/local/go/src/runtime/panic.go in runtime.gopanic at line 502
/usr/local/go/src/runtime/panic.go in runtime.panicslice at line 35
vendor/github.com/andy-kimball/arenaskl/arena.go in vendor/github.com/andy-kimball/arenaskl.(*Skiplist).allocVal at line 84
vendor/github.com/andy-kimball/arenaskl/skl.go in vendor/github.com/andy-kimball/arenaskl.(*Skiplist).allocVal at line 199
vendor/github.com/andy-kimball/arenaskl/iterator.go in vendor/github.com/andy-kimball/arenaskl.(*Iterator).Set at line 238
pkg/storage/tscache/interval_skl.go in pkg/storage/tscache.(*sklPage).ratchetValueSet at line 870
pkg/storage/tscache/interval_skl.go in pkg/storage/tscache.(*sklPage).scanTo at line 1022
pkg/storage/tscache/interval_skl.go in pkg/storage/tscache.(*sklPage).incomingGapVal at line 982
pkg/storage/tscache/interval_skl.go in pkg/storage/tscache.(*sklPage).maxInRange at line 932
pkg/storage/tscache/interval_skl.go in pkg/storage/tscache.(*sklPage).lookupTimestampRange at line 548
pkg/storage/tscache/interval_skl.go in pkg/storage/tscache.(*intervalSkl).LookupTimestampRange at line 488
pkg/storage/tscache/interval_skl.go in pkg/storage/tscache.(*intervalSkl).LookupTimestamp at line 446
pkg/storage/tscache/skl_impl.go in pkg/storage/tscache.(*sklImpl).getMax at line 116
pkg/storage/tscache/skl_impl.go in pkg/storage/tscache.(*sklImpl).GetMaxRead at line 103
pkg/storage/replica.go in pkg/storage.(*Replica).applyTimestampCache at line 2596
pkg/storage/replica.go in pkg/storage.(*Replica).tryExecuteWriteBatch at line 3230
pkg/storage/replica.go in pkg/storage.(*Replica).executeWriteBatch at line 3097
pkg/storage/replica.go in pkg/storage.(*Replica).sendWithRangeID at line 2026
pkg/storage/replica.go in pkg/storage.(*Replica).Send at line 1974
pkg/storage/store.go in pkg/storage.(*Store).Send at line 3168
pkg/storage/stores.go in pkg/storage.(*Stores).Send at line 185
pkg/server/node.go in pkg/server.(*Node).batchInternal.func1 at line 987
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunTaskWithErr at line 303
pkg/server/node.go in pkg/server.(*Node).batchInternal at line 974
pkg/server/node.go in pkg/server.(*Node).Batch at line 1015
pkg/roachpb/api.pb.go in pkg/roachpb._Internal_Batch_Handler.func1 at line 6633
vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go in vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1 at line 48
pkg/rpc/context.go in pkg/rpc.NewServerWithInterceptor.func1 at line 197
pkg/rpc/context.go in pkg/rpc.NewServerWithInterceptor.func3 at line 227
pkg/roachpb/api.pb.go in pkg/roachpb._Internal_Batch_Handler at line 6635
vendor/google.golang.org/grpc/server.go in vendor/google.golang.org/grpc.(*Server).processUnaryRPC at line 1011
vendor/google.golang.org/grpc/server.go in vendor/google.golang.org/grpc.(*Server).handleStream at line 1249
vendor/google.golang.org/grpc/server.go in vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1 at line 680
Tag Value
Cockroach Release v2.1.3
Cockroach SHA: 0c87b11
Platform linux amd64
Distribution CCL
Environment v2.1.3
Command server
Go Version go1.10.3
# of CPUs 4
# of Goroutines 237
@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Mar 8, 2019
@nvanbenschoten nvanbenschoten self-assigned this Mar 9, 2019
nvanbenschoten added a commit to nvanbenschoten/arenaskl that referenced this issue Mar 10, 2019
This change protects against large allocations causing Arena's
internal size accounting to overflow and produce incorrect results.
This overflow could allow for invalid offsets being returned from
`Arena.Alloc`, leading to slice bounds and index out of range panics
when passed to `Arena.GetBytes` or `Arena.GetPointer`.

Specifically, if `Arena.n` overflowed in `Arena.Alloc`, which was
possible because it was a uint32 and we allow up to MaxUint32 size
allocations, then `Arena.Alloc` could bypass the length check against
`Arena.buf`. The "padded" size would then be subtracted from the offset,
allowing the offset to underflow back around to an offset that was out
of `Arena.buf`'s bounds.

I believe this is the underlying cause of the following two crashes:
- cockroachdb/cockroach#31624
- cockroachdb/cockroach#35557

The reason for this is subtle and has to do with failed allocations
slowly building up and eventually overflowing `Arena.n`. I'll explain
on the PR that vendors this fix.

We now protect against this overflow in two ways. We perform an initial
size check before performing the atomic addition in `Arena.Alloc` to
prevent failed allocations from "building up" to an overflow. We also
use 64-bit arithmetic so that it would take 2^32 concurrent allocations
with the maximum allocation size (math.MaxUint32) to overflow the
accounting. An alternative would be to use a CAS loop to strictly
serialize all additions to `Arena.n`, but that would limit concurrency.
@nvanbenschoten
Copy link
Member

I have a suspected fix for this: andy-kimball/arenaskl#4.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 12, 2019
Fixes cockroachdb#31624.
Fixes cockroachdb#35557.

This commit picks up andy-kimball/arenaskl#4.

I strongly suspect that the uint32 overflow fixed in that PR was the
cause of the two index out of bounds panics. See that commit for more
details.

Release note: None
craig bot pushed a commit that referenced this issue Mar 12, 2019
35321: opt: propagate set operation output types to input columns r=rytaft a=rytaft

This commit updates the `optbuilder` logic for set operations in which
the types of the input columns do not match the types of the output
columns. This can happen if a column on one side has type Unknown,
but the corresponding column on the other side has a known type such
as Int. The known type must be propagated to the side with the unknown
type to prevent errors in the execution engine related to decoding
types.

If there are any column types on either side that don't match the output,
then the `optbuilder` propagates the output types of the set operation down
to the input columns by wrapping the side with mismatched types in a
Project operation. The Project operation passes through columns that
already have the correct type, and creates cast expressions for those
that don't.

Fixes #34524

Release note (bug fix): Fixed an error that happened when executing
some set operations containing only nulls in one of the input columns.

35587: opt: add more cost for lookup joins with more ON conditions r=RaduBerinde a=RaduBerinde

This is a very limited fix for #34810.

The core problem is that we don't take into account that if we have an
ON condition, not only there's a cost to evaluate it on each row, but
we are generating more internal rows to get a given number of output
rows.

I attempted to do a more general fix (for all join types), where I
tried to estimate the "internal" number of rows using
`unknownFilterSelectivity` for each ON condition. There were two
problems:
 - in some cases (especially with lookup joins) we have an extra ON
   condition that doesn't actually do anything:
     `ab JOIN xy ON a=x AND a=10` becomes
     `ab JOIN xy ON a=x AND a=10 AND x=10` becomes
   and `a=10` could remain as an ON condition. This results in bad
   query plans in important cases (e.g. TPCC) where it prefers to do
   an extra lookup join (due to a non-covering index) just because of
   that condition.

 - we don't have the equality columns readily available for hash join
   (and didn't want to extract them each time we cost). In the future
   we may split the planning into a logical and physical stage, and we
   should then separate the logical joins from hash join.

For 19.1, we simply simply add a cost for lookup joins that is
proportional to the number of remaining ON conditions. This is the
least disruptive method that still fixes the case observed in #34810.
I will leave the issue open to address this properly in the next
release.

Note that although hash joins and merge joins have the same issue in
principle, in practice we always generate these expressions with
equality on all possible columns.

Release note: None

35630: storage/tscache: Pick up andy-kimball/arenaskl fix r=nvanbenschoten a=nvanbenschoten

Fixes #31624.
Fixes #35557.

This commit picks up andy-kimball/arenaskl#4.

I strongly suspect that the uint32 overflow fixed in that PR was the
cause of the two index out of bounds panics. See that commit for more
details.

The PR also fixes a bug in memory recylcling within the tscache. I confirmed
on adriatic that over 900 64MB arenas had been allocated since it was last
wiped.

35644: opt: use correct ordering for insert input in execbuilder r=RaduBerinde a=RaduBerinde

We were setting up a projection on the Insert's input but we were
accidentally using the parent Insert's ordering instead of that of the
input.

Fixes #35564.

Release note (bug fix): Fixed a "column not in input" crash when
`INSERT ... RETURNING` is used inside a clause that requires an
ordering.

35651: jobs, sql, ui: Create `AutoCreateStats` job type r=celiala a=celiala

With #34279, enabling the cluster setting
`sql.stats.experimental_automatic_collection.enabled` has the potential
to create many CreateStats jobs, which can cause the Jobs view on the
AdminUI to become cluttered.

This commit creates a new `AutoCreateStats` job type for these auto-created
CreateStats jobs, so that users are able to still see their own manual runs
of CREATE STATISTICS, via the pre-existing `CreateStats` type.

cc @danhhz, @piyush-singh, @rolandcrosby 

![jobs-auto-create-stats](https://user-images.githubusercontent.com/3051672/54212467-5cea2c80-44b9-11e9-9c11-db749814f019.gif)

Release note (admin ui change): AutoCreateStats type added to
Jobs page to filter automatic statistics jobs.

Fixes #34377.

Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Celia La <[email protected]>
@craig craig bot closed this as completed in #35630 Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants