Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: panic in backup resumer #52279

Closed
rohany opened this issue Aug 3, 2020 · 3 comments · Fixed by #52281
Closed

backupccl: panic in backup resumer #52279

rohany opened this issue Aug 3, 2020 · 3 comments · Fixed by #52281
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@rohany
Copy link
Contributor

rohany commented Aug 3, 2020

I'm on 66a3c19, and I've just started a local node and running the kv workload against it. Within a few minutes I see a panic:

* ERROR: [n1] a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
*   | runtime.gopanic
*   | 	/usr/local/Cellar/go/1.14/libexec/src/runtime/panic.go:967
*   | runtime.panicmem
*   | 	/usr/local/Cellar/go/1.14/libexec/src/runtime/panic.go:212
*   | runtime.sigpanic
*   | 	/usr/local/Cellar/go/1.14/libexec/src/runtime/signal_unix.go:687
*   | github.com/cockroachdb/cockroach/pkg/storage/cloudimpl.newResumingHTTPReader
*   | 	/Users/rohany/go/src/github.com/cockroachdb/cockroach/pkg/storage/cloudimpl/http_storage.go:149
*   | github.com/cockroachdb/cockroach/pkg/storage/cloudimpl.(*httpStorage).ReadFile
*   | 	/Users/rohany/go/src/github.com/cockroachdb/cockroach/pkg/storage/cloudimpl/http_storage.go:254
*   | github.com/cockroachdb/cockroach/pkg/ccl/backupccl.readBackupManifest
*   | 	/Users/rohany/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/manifest_handling.go:155
*   | github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*backupResumer).Resume
*   | 	/Users/rohany/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backup_job.go:450
*   | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
*   | 	/Users/rohany/go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:850
*   | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resume.func1
*   | 	/Users/rohany/go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:986
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
*   | 	/Users/rohany/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:347
*   | runtime.goexit
*   | 	/usr/local/Cellar/go/1.14/libexec/src/runtime/asm_amd64.s:1373
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString
@blathers-crl
Copy link

blathers-crl bot commented Aug 3, 2020

Hi @rohany, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@rohany rohany added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 3, 2020
craig bot pushed a commit that referenced this issue Aug 4, 2020
51503: rpc: use tenant client/server certs r=nvanbenschoten a=tbg


With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches #47898.

Release note: None

- cli/flags,config: new flag for tenant KV listen addr
- sql: route tenant KV traffic to tenant KV address
- roachtest: configure --tenant-addr flag in acceptance/multitenant
- rpc: add TenantID to rpc.ContextOptions
- security: slight test improvement
- rpc: pass TenantID to SecurityContext to CertManager
- security: use a single test_certs dir
- security: embed certs for a few hard-coded tenants
- rpc: *almost* use tenant client certs (on tenants)
- rpc: use tenant client/server certs where appropriate


52281: bulkio: Correctly handle exhausting retries when reading from HTTP. r=rohany a=miretskiy

Fixes #52279

Return an error if we exhaust the retry budget when reading from HTTP.

Release Notes: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@RaduBerinde
Copy link
Member

Also seeing the same crash in sqlsmith roachtests: https://teamcity.cockroachdb.com/viewLog.html?buildId=2143445

@rohany
Copy link
Contributor Author

rohany commented Aug 4, 2020

Yup, these have the same root cause. I was running into the panics because of pending SQLSmith backup jobs in my data directory

craig bot pushed a commit that referenced this issue Aug 13, 2020
52281: bulkio: Correctly handle exhausting retries when reading from HTTP. r=rohany a=miretskiy

Fixes #52279

Return an error if we exhaust the retry budget when reading from HTTP.

Release Notes: None

52696: cli/demo: activate Sentry crash reporting for `cockroach demo` r=irfansharif a=knz

Prior to this patch, the TestServer instance was using its own
`cluster.Settings` instance separate from the global one defined in
`cli.serverCfg`, which is the one used by the `log` package to decide
whether or not to send sentry reports.

Since the `Settings` objects did not match, sentry reporting never got
enabled for `cockroach demo`.

This patch fixes it.

This can be checked by running `cockroach demo --logtostderr=INFO`,
with `COCKROACH_CRASH_REPORTS` set to a suitable URL. The reporting
will be visible as the following log line (example):

```
E200812 16:27:10.520228 1764 util/log/crash_reporting.go:319  [n1,client=127.0.0.1:33741,hostssl,user=root] Queued as error 694cfa684ea14a528be7de9e79aa5947
```

Release note (cli change): Crashes in `cockroach demo` sessions are
now reported to telemetry, if telemetry is enabled.

52749: rowexec,distsql: fix a couple of things around aggregators r=yuzefovich a=yuzefovich

**rowexec: fix closing of aggregator bucket in case of an error**

Previously, the bucket of aggregate functions wouldn't get properly
closed if an aggregate function returned an error on `Result` call. Now
this is fixed.

Release note: None

**distsql: fix TestAggregatorAgainstProcessor**

In `TestAggregatorAgainstProcessor` we generate input types randomly
relying on `execinfrapb.GetAggregateInfo` function to perform the
"type-checking". That function, however, can have "false negative" for
several aggregate functions when a tuple is chosen as an input type, so
we have special handling for those cases.

Fixes: #52742.

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants