-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: nil pointer crash in storage.(*Writer).open
during backup in 22.2.9
#103597
Comments
cc @cockroachdb/disaster-recovery |
Some progress: I patched 22.2.9 so that we keep track of all open writers until they are closed and can confirm this appears to be a bug in our usage, not upstream:
So we're closing the sink, think closing the storage, but somehow closing the sink returned with one of the writers still unclosed. Going to add a bit more logging to see if I can figure out how. |
104187: backupccl: always track opened writer r=dt a=dt Previously we would do things -- like wrap the opened writer with an encryption shim -- after opening the writer but before storing it in the sink. This could mean that if the sink opened a writer, but failed to save it, and was then closed it would fail to close the writer. This changes that, so that as soon as the writer is opened it is saved and then if it is later wrapped, saved again, to ensure that we cannot lose track of any successfully opened writer. Fixes: #103597. Release note: none. Co-authored-by: David Taylor <[email protected]>
Previously we would do things -- like wrap the opened writer with an encryption shim -- after opening the writer but before storing it in the sink. This could mean that if the sink opened a writer, but failed to save it, and was then closed it would fail to close the writer. This changes that, so that as soon as the writer is opened it is saved and then if it is later wrapped, saved again, to ensure that we cannot lose track of any successfully opened writer. Fixes: #103597. Release note: none.
During a roachtest (#103228), two nodes crashed while a backup was taken, both due to a panic within the GCS library:
Every node in the 4-node cluster was running 22.2.9.
Stack traces for the two nodes that crashed (n3 and n4) are attached below. Note that a very similar crash had been reported before [1], and deemed fixed by [2]. However, the issue doesn't seem to be completely solved.
Reproduction
Running the
backup-restore/mixed-version
roachtest in #103228 with seed-4303022106448172299
seems to reproduce this with high probability (~1h30m after test start).n3_stacks.txt
n4_stacks.txt
Roachtest artifacts: https://console.cloud.google.com/storage/browser/cockroach-tmp/103597/roachtest_artifacts;tab=objects?project=cockroach-shared&prefix=&forceOnObjectsSortingFiltering=false&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))
[1] googleapis/google-cloud-go#4167
[2] #65660
Jira issue: CRDB-28094
The text was updated successfully, but these errors were encountered: