Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus crashed upon hitting CRDB connection reset by peer [exhausted] error #5026

Open
askfongjojo opened this issue Feb 8, 2024 · 7 comments
Milestone

Comments

@askfongjojo
Copy link

05:51:33.436Z ERRO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): delete_crucible_region: region_get saw Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "7a3a9fb8-10c9-47d6-b878-edb41d8cf9d2", "content-length": "84", "date": "Thu, 08 Feb 2024 05:51:33 GMT"}; value: Error { error_code: None, message: "Not Found", request_id: "7a3a9fb8-10c9-47d6-b878-edb41d8cf9d2" }
    file = nexus/src/app/sagas/common_storage.rs:162
    region_id = 1936ad0b-9d9a-4682-97c2-c64f1594b216
    saga_type = recovery
thread 'tokio-runtime-worker' panicked at nexus/db-queries/src/db/sec_store.rs:65:60:
called `Result::unwrap()` on an `Err` value: InternalError { internal_message: "database error (kind = Unknown): result is ambiguous: error=rpc error: code = Unavailable desc = error reading from server: read tcp [fd00:1122:3344:109::3]:49862->[fd00:1122:3344:105::3]:32221: read: connection reset by peer [exhausted]\n" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Feb  8 05:52:04 Stopping because all processes in service exited. ]
[ Feb  8 05:52:04 Executing stop method (:kill). ]

This started happening after rack2 was updated to omicron commit e88e06bf8e5eed16b42ac78ca536b06fdd0dc183. I was using terraform to create a bunch of disks and many started failing and attempted to unwind. I'll include the link to the complete nexus logs in a bit and also tracking down what the first disk_create failure was complaining about.

@askfongjojo askfongjojo added this to the 6 milestone Feb 8, 2024
@askfongjojo
Copy link
Author

askfongjojo commented Feb 8, 2024

Looks like a side effect of #5022. The complete nexus log can be found on catacomb - /staff/dock/rack2/omicron-5026/nexus.log.

@askfongjojo askfongjojo changed the title Nexus crashed upon hitting 404 error during delete_crucible_region: region_get Nexus crashed upon hitting CRDB connection reset by peer [exhausted] error Feb 8, 2024
@askfongjojo
Copy link
Author

askfongjojo commented Feb 8, 2024

On the CRDB in sled 9 (which corresponds to the endpoint [fd00:1122:3344:105::3]:32221 in the nexus error message), I saw these log lines in cockroach.log about 10 seconds before nexus logged the database error:

root@oxz_cockroachdb_4c3ef132:/data/logs# less cockroach.oxzcockroachdb4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.root.2024-02-08T05_44_53Z.012360.log
...
W240208 05:51:23.117813 1624 server/node.go:894 <E2><8B><AE> [n2,summaries] 1067  health alerts detected: {Alerts:[{StoreID:0 Category:
<E2><80><B9>METRICS<E2><80><BA> Description:<E2><80><B9>round-trip-latency-p90<E2><80><BA> Value:1.342177279e+09}]}
W240208 05:51:23.365682 19747 kv/kvserver/raft_transport.go:674 <E2><8B><AE> [n2] 1068  while processing outgoing Raft queue to node 5: 
<E2><80><B9>EOF<E2><80><BA>:
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069  <E2><80><B9>[core]<E2><80><BA>
<E2><80><B9>[Channel #282 SubChannel #283] grpc: addrConn.createTransport failed to connect to {<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Addr": "[fd00:1
122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "ServerName": "[
fd00:1122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Attributes": nu
ll,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "BalancerAttribu
tes": null,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Type": 0,<E2>
<80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Metadata": null
<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>}. Err: connection
 error: desc = "transport: Error while dialing cannot reuse client connection"<E2><80><BA>
I240208 05:51:24.105617 754 1@gossip/gossip.go:1547 <E2><8B><AE> [n2] 1070  node has connected to cluster via gossip
W240208 05:51:24.227311 22840 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1071  closed timestamps side-transport
 connection dropped from node: 3
W240208 05:51:24.341680 23133 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1072  closed timestamps side-transport
 connection dropped from node: 1
W240208 05:51:25.085137 19055 2@rpc/clock_offset.go:216 <E2><8B><AE> [n2,rnode=5,raddr=<E2><80><B9>[fd00:1122:3344:108::3]:32221<E2><80>
<BA>,class=system,heartbeat] 1073  latency jump (prev avg 669.83ms, current 5950.24ms)

The complete CRDB log can be found in catacomb, /staff/dock/rack2/omicron-5026/cockroachdb.log.

@askfongjojo
Copy link
Author

I checked the 4 most recent nexus log files to see what precedes the database connection reset errors. They are not consistent - there are region delete, background nat rpw tasks, and blueprint planner tasks. At this point, it is unclear if disk create or #5022 are causing the issue. The failed disk create or unwind may be symptoms rather than causes. @mkeeter and @lefttwo saw multiple timeouts between the upstairs and downstairs so we may be dealing with a rack-level networking issue. More debugging is still needed.

@davepacheco
Copy link
Collaborator

Given that Nexus lost its connection to CockroachDB, the crash is #2416. The question is why did that connection become lost?

@augustuswm
Copy link
Contributor

Checked the nexus core files that we collected from dogfood and confirmed that they each panicked at the above link point.

@morlandi7
Copy link

Notes:

  • This was cleared by a restart and we don't believe this is a regression (was not a ship blocker)
  • Followup item is to consider a longer term effort to replace SDE
  • An issue was opened with intel

@morlandi7 morlandi7 assigned morlandi7 and Nieuwejaar and unassigned morlandi7 Mar 5, 2024
@morlandi7 morlandi7 modified the milestones: 6, 7 Mar 5, 2024
@Nieuwejaar
Copy link
Contributor

The network failure is being worked as dendrite issue: https://github.com/oxidecomputer/dendrite/issues/846

@morlandi7 morlandi7 modified the milestones: 7, MVP Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants