Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Repro bb8 issues #5351

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Mar 29, 2024

As a part of #5172 , I hit a bug during RSS handoff.

In particular:

  • Sled Agent called "initialization-completed"
  • Nexus started a transaction to update a bunch of data
  • Nexus incorrectly sent a request on a "new connection", rather than the "transaction connection"

Expected behavior:

  • Initialization succeeds, but without safe transaction semantics? Or perhaps bb8 complains about being exhausted of connections, if it's blocking anywhere? I'm not totally sure I understand the semantics yet.

Observed behavior:

  • Several connections -- even ones from unrelated background jobs -- started returning "Timed out in bb8" errors when attempting to access new connections. Furthermore, the "initialization-completed" request appeared to hang indefinitely.

This PR attempts to act as a reproduction case for that class of issues.

I'm also working on adding a reproduction case to https://github.com/oxidecomputer/async-bb8-diesel , but I haven't managed that quite yet.

@smklein
Copy link
Collaborator Author

smklein commented Mar 29, 2024

If this reproduces, I think I'm going to take the following tactics:

  • Try to create a smaller reproduction. This is a big beefy transaction that needs a lot of stars aligned to work. Would be nice if we could re-create this with a smaller endpoint, or ideally, without a dropshot endpoint at all.
  • Add a lot more inspection in bb8. What is the status of our transaction pool? How many connections do we think are open?
  • Add more inspection on the Cockroach side. Is there any way to see which transactions are still open?

I have theories about how things could be going wrong, but need more data to validate.

  • Is this triggered purely by "getting a new connection from within a transaction, which already has a transaction" -> I don't think so. This doesn't reproduce minimally (I tried, in async-bb8-diesel), and I think is happening implicitly with all the auth check calls that aren't using the same connection as the higher-level transaction.
  • Could there be a deadlock here between transactions? -> This seems possible to me? Perhaps the transaction touches rows that are getting poked at by the "independently-checked out connection", and the inability of the transaction to complete hangs both?
  • Could local state in Diesel (or bb8) be corrupted? This is possible, but I believe the "Transaction manager" semantics are "per-connection", so this seems unlikely to me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant