-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus crashed upon hitting CRDB connection reset by peer [exhausted]
error
#5026
Comments
Looks like a side effect of #5022. The complete nexus log can be found on catacomb - /staff/dock/rack2/omicron-5026/nexus.log. |
delete_crucible_region: region_get
connection reset by peer [exhausted]
error
On the CRDB in sled 9 (which corresponds to the endpoint [fd00:1122:3344:105::3]:32221 in the nexus error message), I saw these log lines in cockroach.log about 10 seconds before nexus logged the database error:
The complete CRDB log can be found in catacomb, /staff/dock/rack2/omicron-5026/cockroachdb.log. |
I checked the 4 most recent nexus log files to see what precedes the database connection reset errors. They are not consistent - there are region delete, background nat rpw tasks, and blueprint planner tasks. At this point, it is unclear if disk create or #5022 are causing the issue. The failed disk create or unwind may be symptoms rather than causes. @mkeeter and @lefttwo saw multiple timeouts between the upstairs and downstairs so we may be dealing with a rack-level networking issue. More debugging is still needed. |
Given that Nexus lost its connection to CockroachDB, the crash is #2416. The question is why did that connection become lost? |
Checked the nexus core files that we collected from dogfood and confirmed that they each panicked at the above link point. |
Notes:
|
The network failure is being worked as dendrite issue: https://github.com/oxidecomputer/dendrite/issues/846 |
This started happening after rack2 was updated to omicron commit
e88e06bf8e5eed16b42ac78ca536b06fdd0dc183
. I was using terraform to create a bunch of disks and many started failing and attempted to unwind. I'll include the link to the complete nexus logs in a bit and also tracking down what the first disk_create failure was complaining about.The text was updated successfully, but these errors were encountered: