Failure to grant database privileges in CockroachDB on startup #1304

bnaecker · 2022-06-29T03:48:18Z

I saw an internal server error installing the control plane this evening. I built and installed Omicron using omicron-package, building from a local commit merging #1298 and the current main, which is at 154a4a6. From a fresh reboot I:

pfexec ./tools/create_virtual_hardware.sh
cargo build --bin omicron-package --release
./target/release/omicron-package package
pfexec ./target/release/omicron-package install

Looking at the sled agent logs, I saw:

[2022-06-29T03:33:29.08339844Z]  INFO: SledAgent/StorageManager/6346 on feldspar: CRDB is online
[2022-06-29T03:33:29.084297094Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB
[2022-06-29T03:33:35.983551762Z]  INFO: SledAgent/StorageManager/6346 on feldspar: halt_and_remove_logged: Previous zone state: Running
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:35.986214134Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Stopped and uninstalled zone
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:35.993636362Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 327.794992ms
[2022-06-29T03:33:35.994011147Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus, will retry in 1.07985318s
[2022-06-29T03:33:35.994340186Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 471.061865ms
[2022-06-29T03:33:35.994666535Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 488.023497ms
[2022-06-29T03:33:35.995017057Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus, will retry in 925.598032ms
[2022-06-29T03:33:35.995311707Z]  INFO: SledAgent/dropshot (SledAgent)/6346 on feldspar: request completed (req_id=ed6c07d9-c91a-44ec-aa51-d7948d830391, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:37061, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbinit.sql] executed and failed with status: exit status: 1. Stdout: CREATE DATABASE
    CREATE ROLE
    GRANT
    CREATE TABLE
    CREATE TABLE
    CREATE INDEX
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE INDEX
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE INDEX
    CREATE TABLE
    , Stderr: NOTICE: GRANT INSERT, SELECT, UPDATE, DELETE ON DATABASE is deprecated.
    This statement was automatically converted to USE omicron; ALTER DEFAULT PRIVILEGES GRANT INSERT, SELECT, UPDATE, DELETE ON TABLES TO omicron;
    Please use ALTER DEFAULT PRIVILEGES going forward
    ERROR: driver: bad connection
    ERROR: driver: bad connection
    Failed running "sql"

[2022-06-29T03:33:35.996129256Z]  WARN: SledAgent/RSS/6346 on feldspar: failed to create filesystem (error="Error Response")
[2022-06-29T03:33:36.64487338Z]  INFO: SledAgent/RSS/6346 on feldspar: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
[2022-06-29T03:33:36.647081411Z]  INFO: SledAgent/StorageManager/6346 on feldspar: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
[2022-06-29T03:33:36.647918134Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
[2022-06-29T03:33:36.66757923Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
[2022-06-29T03:33:36.679000179Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
[2022-06-29T03:33:36.695774353Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:36.715740835Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:39.927704567Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Zone booting
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:41.969687215Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:42.536102317Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: Loading CRDB manifest
[2022-06-29T03:33:42.905934849Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
[2022-06-29T03:33:42.94067554Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/store
[2022-06-29T03:33:42.971693725Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/join_addrs
[2022-06-29T03:33:42.998374003Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: refreshing manifest
[2022-06-29T03:33:43.023893849Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: enabling CRDB service
[2022-06-29T03:33:43.047899489Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: awaiting liveness of CRDB
[2022-06-29T03:33:43.05261359Z]  WARN: SledAgent/6346 on feldspar: failed to contact nexus, will retry in 1.316873734s
[2022-06-29T03:33:43.052927673Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:43.251247719Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:43.520264543Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:44.370186624Z]  INFO: SledAgent/6346 on feldspar: contacting server nexus, registering sled: fb0f7546-4d46-40ca-9d56-cbb810684ca7
[2022-06-29T03:33:44.472806569Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:45.822259473Z]  INFO: SledAgent/StorageManager/6346 on feldspar: CRDB is online
[2022-06-29T03:33:45.823833209Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB
[2022-06-29T03:33:58.271956658Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB - Completed

The relevant bit there is the long list of database permissions. The sled agent starts up CRDB, then runs dbinit.sql to initialize the database and schema. That includes a statement granting permissions to operate on the tables in the database. That syntax appears outdated as of cockroachdb/cockroach#73065. The sled agent apparently gets a non-zero exit code from the subprocess running that SQL file, since it tears down the zone and tries again. The second attempt succeeds. I'm not sure why there would be a difference.

We should probably adopt the recommended action which is to use ALTER DEFAULT PRIVILEGES instead.

The text was updated successfully, but these errors were encountered:

smklein · 2022-06-29T12:29:29Z

A similar change broke the CLI github action, because CRDB was not pinned to a specific version: oxidecomputer/oxide.rs#203

While I think we can - and should - update the SQL statement, I don't think we updated the revision of CRDB being used by Omicron recently. Do you have any idea why this failure would occur, if we're still using a months-old version of cockroach?

smklein · 2022-06-29T12:39:00Z

FWIW, on Linux, I upgrade my rev of CRDB to 22.1.2, and saw this issue. But this required me to explicitly modify ./tools/cockroach_version, and the corresponding checksum.

davepacheco · 2022-06-29T15:46:00Z

The error message implies that the syntax change was handled by Cockroach without issue. It looks to me like the real problem is:

    ERROR: driver: bad connection
    ERROR: driver: bad connection
    Failed running "sql"

davepacheco · 2022-06-29T16:05:51Z

@bnaecker is it possible that cockroach died during this phase? You might check the stdout/stderr and the log files it writes to. Alternatively, you could look at the SMF log for CockroachDB to see if SMF restarted it due to processes exiting.

bnaecker · 2022-06-29T16:09:07Z

@davepacheco I noticed that re-reading the error message. It does seem like two distinct failures, since the GRANT usage is deprecated, not incorrect. I'm not sure the SMF log still exists, but I'll try to find it.

bnaecker · 2022-06-29T16:44:25Z

@davepacheco I'm actually not sure we have Cockroach's logfiles at this point. When sled-agent noticed this failure, it tore down the zone and tried again. I think that removes any extant state of the zone, including whatever log files Cockroach was writing, right?

davepacheco · 2022-06-29T20:09:53Z

Reopening this because I don't think #1305 fixed the real issue here.

davepacheco · 2022-06-29T20:12:53Z

@davepacheco I'm actually not sure we have Cockroach's logfiles at this point. When sled-agent noticed this failure, it tore down the zone and tried again. I think that removes any extant state of the zone, including whatever log files Cockroach was writing, right?

If it destroys the ZFS dataset that was used for the CockroachDB files, then yes, it would have destroyed the evidence we'd want for further investigation here.

I'd suggest that if zone setup fails and we want to retry it (which I'm not sure is a good idea, but might be), we should at least archive the on-disk state so we can debug problems like this (which will of course happen in the field too). We may also want to save core files of any processes running in the zone or the ones of interest to us. @smklein I don't know enough about how this works today -- would that be a reasonable issue to file?

bnaecker added the database Related to database access label Jun 29, 2022

smklein self-assigned this Jun 29, 2022

smklein mentioned this issue Jun 29, 2022

[sql] Grant DB privileges in a non-deprecated way #1305

Merged

smklein closed this as completed in #1305 Jun 29, 2022

davepacheco reopened this Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to grant database privileges in CockroachDB on startup #1304

Failure to grant database privileges in CockroachDB on startup #1304

bnaecker commented Jun 29, 2022

smklein commented Jun 29, 2022

smklein commented Jun 29, 2022

davepacheco commented Jun 29, 2022

davepacheco commented Jun 29, 2022

bnaecker commented Jun 29, 2022

bnaecker commented Jun 29, 2022

davepacheco commented Jun 29, 2022

davepacheco commented Jun 29, 2022

Failure to grant database privileges in CockroachDB on startup #1304

Failure to grant database privileges in CockroachDB on startup #1304

Comments

bnaecker commented Jun 29, 2022

smklein commented Jun 29, 2022

smklein commented Jun 29, 2022

davepacheco commented Jun 29, 2022

davepacheco commented Jun 29, 2022

bnaecker commented Jun 29, 2022

bnaecker commented Jun 29, 2022

davepacheco commented Jun 29, 2022

davepacheco commented Jun 29, 2022