Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to grant database privileges in CockroachDB on startup #1304

Open
bnaecker opened this issue Jun 29, 2022 · 8 comments · Fixed by #1305
Open

Failure to grant database privileges in CockroachDB on startup #1304

bnaecker opened this issue Jun 29, 2022 · 8 comments · Fixed by #1305
Assignees
Labels
database Related to database access

Comments

@bnaecker
Copy link
Collaborator

I saw an internal server error installing the control plane this evening. I built and installed Omicron using omicron-package, building from a local commit merging #1298 and the current main, which is at 154a4a6. From a fresh reboot I:

  • pfexec ./tools/create_virtual_hardware.sh
  • cargo build --bin omicron-package --release
  • ./target/release/omicron-package package
  • pfexec ./target/release/omicron-package install

Looking at the sled agent logs, I saw:

[2022-06-29T03:33:29.08339844Z]  INFO: SledAgent/StorageManager/6346 on feldspar: CRDB is online
[2022-06-29T03:33:29.084297094Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB
[2022-06-29T03:33:35.983551762Z]  INFO: SledAgent/StorageManager/6346 on feldspar: halt_and_remove_logged: Previous zone state: Running
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:35.986214134Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Stopped and uninstalled zone
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:35.993636362Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 327.794992ms
[2022-06-29T03:33:35.994011147Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus, will retry in 1.07985318s
[2022-06-29T03:33:35.994340186Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 471.061865ms
[2022-06-29T03:33:35.994666535Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 488.023497ms
[2022-06-29T03:33:35.995017057Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus, will retry in 925.598032ms
[2022-06-29T03:33:35.995311707Z]  INFO: SledAgent/dropshot (SledAgent)/6346 on feldspar: request completed (req_id=ed6c07d9-c91a-44ec-aa51-d7948d830391, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:37061, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbinit.sql] executed and failed with status: exit status: 1. Stdout: CREATE DATABASE
    CREATE ROLE
    GRANT
    CREATE TABLE
    CREATE TABLE
    CREATE INDEX
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE INDEX
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE INDEX
    CREATE TABLE
    , Stderr: NOTICE: GRANT INSERT, SELECT, UPDATE, DELETE ON DATABASE is deprecated.
    This statement was automatically converted to USE omicron; ALTER DEFAULT PRIVILEGES GRANT INSERT, SELECT, UPDATE, DELETE ON TABLES TO omicron;
    Please use ALTER DEFAULT PRIVILEGES going forward
    ERROR: driver: bad connection
    ERROR: driver: bad connection
    Failed running "sql"

[2022-06-29T03:33:35.996129256Z]  WARN: SledAgent/RSS/6346 on feldspar: failed to create filesystem (error="Error Response")
[2022-06-29T03:33:36.64487338Z]  INFO: SledAgent/RSS/6346 on feldspar: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
[2022-06-29T03:33:36.647081411Z]  INFO: SledAgent/StorageManager/6346 on feldspar: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
[2022-06-29T03:33:36.647918134Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
[2022-06-29T03:33:36.66757923Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
[2022-06-29T03:33:36.679000179Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
[2022-06-29T03:33:36.695774353Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:36.715740835Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:39.927704567Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Zone booting
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:41.969687215Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:42.536102317Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: Loading CRDB manifest
[2022-06-29T03:33:42.905934849Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
[2022-06-29T03:33:42.94067554Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/store
[2022-06-29T03:33:42.971693725Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/join_addrs
[2022-06-29T03:33:42.998374003Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: refreshing manifest
[2022-06-29T03:33:43.023893849Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: enabling CRDB service
[2022-06-29T03:33:43.047899489Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: awaiting liveness of CRDB
[2022-06-29T03:33:43.05261359Z]  WARN: SledAgent/6346 on feldspar: failed to contact nexus, will retry in 1.316873734s
[2022-06-29T03:33:43.052927673Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:43.251247719Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:43.520264543Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:44.370186624Z]  INFO: SledAgent/6346 on feldspar: contacting server nexus, registering sled: fb0f7546-4d46-40ca-9d56-cbb810684ca7
[2022-06-29T03:33:44.472806569Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:45.822259473Z]  INFO: SledAgent/StorageManager/6346 on feldspar: CRDB is online
[2022-06-29T03:33:45.823833209Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB
[2022-06-29T03:33:58.271956658Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB - Completed

The relevant bit there is the long list of database permissions. The sled agent starts up CRDB, then runs dbinit.sql to initialize the database and schema. That includes a statement granting permissions to operate on the tables in the database. That syntax appears outdated as of cockroachdb/cockroach#73065. The sled agent apparently gets a non-zero exit code from the subprocess running that SQL file, since it tears down the zone and tries again. The second attempt succeeds. I'm not sure why there would be a difference.

We should probably adopt the recommended action which is to use ALTER DEFAULT PRIVILEGES instead.

@bnaecker bnaecker added the database Related to database access label Jun 29, 2022
@smklein smklein self-assigned this Jun 29, 2022
@smklein
Copy link
Collaborator

smklein commented Jun 29, 2022

A similar change broke the CLI github action, because CRDB was not pinned to a specific version: oxidecomputer/oxide.rs#203

While I think we can - and should - update the SQL statement, I don't think we updated the revision of CRDB being used by Omicron recently. Do you have any idea why this failure would occur, if we're still using a months-old version of cockroach?

@smklein
Copy link
Collaborator

smklein commented Jun 29, 2022

FWIW, on Linux, I upgrade my rev of CRDB to 22.1.2, and saw this issue. But this required me to explicitly modify ./tools/cockroach_version, and the corresponding checksum.

@davepacheco
Copy link
Collaborator

The error message implies that the syntax change was handled by Cockroach without issue. It looks to me like the real problem is:

    ERROR: driver: bad connection
    ERROR: driver: bad connection
    Failed running "sql"

@davepacheco
Copy link
Collaborator

@bnaecker is it possible that cockroach died during this phase? You might check the stdout/stderr and the log files it writes to. Alternatively, you could look at the SMF log for CockroachDB to see if SMF restarted it due to processes exiting.

@bnaecker
Copy link
Collaborator Author

@davepacheco I noticed that re-reading the error message. It does seem like two distinct failures, since the GRANT usage is deprecated, not incorrect. I'm not sure the SMF log still exists, but I'll try to find it.

@bnaecker
Copy link
Collaborator Author

@davepacheco I'm actually not sure we have Cockroach's logfiles at this point. When sled-agent noticed this failure, it tore down the zone and tried again. I think that removes any extant state of the zone, including whatever log files Cockroach was writing, right?

@davepacheco
Copy link
Collaborator

Reopening this because I don't think #1305 fixed the real issue here.

@davepacheco davepacheco reopened this Jun 29, 2022
@davepacheco
Copy link
Collaborator

@davepacheco I'm actually not sure we have Cockroach's logfiles at this point. When sled-agent noticed this failure, it tore down the zone and tried again. I think that removes any extant state of the zone, including whatever log files Cockroach was writing, right?

If it destroys the ZFS dataset that was used for the CockroachDB files, then yes, it would have destroyed the evidence we'd want for further investigation here.

I'd suggest that if zone setup fails and we want to retry it (which I'm not sure is a good idea, but might be), we should at least archive the on-disk state so we can debug problems like this (which will of course happen in the field too). We may also want to save core files of any processes running in the zone or the ones of interest to us. @smklein I don't know enough about how this works today -- would that be a reasonable issue to file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to database access
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants