Nexus crashed upon hitting CRDB `connection reset by peer [exhausted]` error #5026

askfongjojo · 2024-02-08T06:44:29Z

05:51:33.436Z ERRO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): delete_crucible_region: region_get saw Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "7a3a9fb8-10c9-47d6-b878-edb41d8cf9d2", "content-length": "84", "date": "Thu, 08 Feb 2024 05:51:33 GMT"}; value: Error { error_code: None, message: "Not Found", request_id: "7a3a9fb8-10c9-47d6-b878-edb41d8cf9d2" }
    file = nexus/src/app/sagas/common_storage.rs:162
    region_id = 1936ad0b-9d9a-4682-97c2-c64f1594b216
    saga_type = recovery
thread 'tokio-runtime-worker' panicked at nexus/db-queries/src/db/sec_store.rs:65:60:
called `Result::unwrap()` on an `Err` value: InternalError { internal_message: "database error (kind = Unknown): result is ambiguous: error=rpc error: code = Unavailable desc = error reading from server: read tcp [fd00:1122:3344:109::3]:49862->[fd00:1122:3344:105::3]:32221: read: connection reset by peer [exhausted]\n" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Feb  8 05:52:04 Stopping because all processes in service exited. ]
[ Feb  8 05:52:04 Executing stop method (:kill). ]

This started happening after rack2 was updated to omicron commit e88e06bf8e5eed16b42ac78ca536b06fdd0dc183. I was using terraform to create a bunch of disks and many started failing and attempted to unwind. I'll include the link to the complete nexus logs in a bit and also tracking down what the first disk_create failure was complaining about.

The text was updated successfully, but these errors were encountered:

askfongjojo · 2024-02-08T07:33:03Z

Looks like a side effect of #5022. The complete nexus log can be found on catacomb - /staff/dock/rack2/omicron-5026/nexus.log.

askfongjojo · 2024-02-08T16:13:37Z

On the CRDB in sled 9 (which corresponds to the endpoint [fd00:1122:3344:105::3]:32221 in the nexus error message), I saw these log lines in cockroach.log about 10 seconds before nexus logged the database error:

root@oxz_cockroachdb_4c3ef132:/data/logs# less cockroach.oxzcockroachdb4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.root.2024-02-08T05_44_53Z.012360.log
...
W240208 05:51:23.117813 1624 server/node.go:894 <E2><8B><AE> [n2,summaries] 1067  health alerts detected: {Alerts:[{StoreID:0 Category:
<E2><80><B9>METRICS<E2><80><BA> Description:<E2><80><B9>round-trip-latency-p90<E2><80><BA> Value:1.342177279e+09}]}
W240208 05:51:23.365682 19747 kv/kvserver/raft_transport.go:674 <E2><8B><AE> [n2] 1068  while processing outgoing Raft queue to node 5: 
<E2><80><B9>EOF<E2><80><BA>:
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069  <E2><80><B9>[core]<E2><80><BA>
<E2><80><B9>[Channel #282 SubChannel #283] grpc: addrConn.createTransport failed to connect to {<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Addr": "[fd00:1
122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "ServerName": "[
fd00:1122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Attributes": nu
ll,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "BalancerAttribu
tes": null,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Type": 0,<E2>
<80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Metadata": null
<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>}. Err: connection
 error: desc = "transport: Error while dialing cannot reuse client connection"<E2><80><BA>
I240208 05:51:24.105617 754 1@gossip/gossip.go:1547 <E2><8B><AE> [n2] 1070  node has connected to cluster via gossip
W240208 05:51:24.227311 22840 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1071  closed timestamps side-transport
 connection dropped from node: 3
W240208 05:51:24.341680 23133 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1072  closed timestamps side-transport
 connection dropped from node: 1
W240208 05:51:25.085137 19055 2@rpc/clock_offset.go:216 <E2><8B><AE> [n2,rnode=5,raddr=<E2><80><B9>[fd00:1122:3344:108::3]:32221<E2><80>
<BA>,class=system,heartbeat] 1073  latency jump (prev avg 669.83ms, current 5950.24ms)

The complete CRDB log can be found in catacomb, /staff/dock/rack2/omicron-5026/cockroachdb.log.

askfongjojo · 2024-02-08T17:08:32Z

I checked the 4 most recent nexus log files to see what precedes the database connection reset errors. They are not consistent - there are region delete, background nat rpw tasks, and blueprint planner tasks. At this point, it is unclear if disk create or #5022 are causing the issue. The failed disk create or unwind may be symptoms rather than causes. @mkeeter and @lefttwo saw multiple timeouts between the upstairs and downstairs so we may be dealing with a rack-level networking issue. More debugging is still needed.

davepacheco · 2024-02-08T17:12:56Z

Given that Nexus lost its connection to CockroachDB, the crash is #2416. The question is why did that connection become lost?

augustuswm · 2024-02-08T19:25:52Z

Checked the nexus core files that we collected from dogfood and confirmed that they each panicked at the above link point.

morlandi7 · 2024-03-05T16:10:19Z

Notes:

This was cleared by a restart and we don't believe this is a regression (was not a ship blocker)
Followup item is to consider a longer term effort to replace SDE
An issue was opened with intel

Nieuwejaar · 2024-03-05T19:02:34Z

The network failure is being worked as dendrite issue: https://github.com/oxidecomputer/dendrite/issues/846

askfongjojo added this to the 6 milestone Feb 8, 2024

askfongjojo changed the title ~~Nexus crashed upon hitting 404 error during delete_crucible_region: region_get~~ Nexus crashed upon hitting CRDB connection reset by peer [exhausted] error Feb 8, 2024

morlandi7 assigned morlandi7 and Nieuwejaar and unassigned morlandi7 Mar 5, 2024

morlandi7 modified the milestones: 6, 7 Mar 5, 2024

morlandi7 unassigned Nieuwejaar Mar 12, 2024

morlandi7 modified the milestones: 7, MVP Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus crashed upon hitting CRDB `connection reset by peer [exhausted]` error #5026

Nexus crashed upon hitting CRDB `connection reset by peer [exhausted]` error #5026

askfongjojo commented Feb 8, 2024

askfongjojo commented Feb 8, 2024 •

edited

Loading

askfongjojo commented Feb 8, 2024 •

edited

Loading

askfongjojo commented Feb 8, 2024

davepacheco commented Feb 8, 2024

augustuswm commented Feb 8, 2024

morlandi7 commented Mar 5, 2024

Nieuwejaar commented Mar 5, 2024

Nexus crashed upon hitting CRDB connection reset by peer [exhausted] error #5026

Nexus crashed upon hitting CRDB connection reset by peer [exhausted] error #5026

Comments

askfongjojo commented Feb 8, 2024

askfongjojo commented Feb 8, 2024 • edited Loading

askfongjojo commented Feb 8, 2024 • edited Loading

askfongjojo commented Feb 8, 2024

davepacheco commented Feb 8, 2024

augustuswm commented Feb 8, 2024

morlandi7 commented Mar 5, 2024

Nieuwejaar commented Mar 5, 2024

Nexus crashed upon hitting CRDB `connection reset by peer [exhausted]` error #5026

Nexus crashed upon hitting CRDB `connection reset by peer [exhausted]` error #5026

askfongjojo commented Feb 8, 2024 •

edited

Loading

askfongjojo commented Feb 8, 2024 •

edited

Loading