Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[10.x] Detect MySQL read-only mode error as a lost connection #48517

Closed

Conversation

cosmastech
Copy link
Contributor

@cosmastech cosmastech commented Sep 24, 2023

To resolve #48486. The case described in the issue is using database for the queue connection and AWS Aurora failing over during the queue worker run.

If database is in read-only mode, then queue worker cannot update tables as needed and should kill the worker.

I admit that I do not know what consequences there might be for including this in the DetectsLostConnections trait, as it is employed by several other classes... seems to me if the PDO is throwing an exception and that exception is reporting a read-only error, it's probably safe to call it a lost connection, but I'll admit I'm looking at this through a pretty myopic lens.

@cosmastech cosmastech force-pushed the feature/read-only-mode-failure branch from a88ac20 to 326dc6f Compare September 24, 2023 11:15
@cosmastech cosmastech marked this pull request as ready for review September 24, 2023 11:15
@taylorotwell
Copy link
Member

So... does this fix the issue? Have you tested it in a real scenario?

@taylorotwell taylorotwell marked this pull request as draft September 25, 2023 00:42
@cosmastech
Copy link
Contributor Author

So... does this fix the issue? Have you tested it in a real scenario?

I ran a test scenario, not against a master failover.

Create a job that sets the database to read-only. The worker dies with the PR change. Without it, the queue worker will keep running but will never be able to advance.

@GrahamCampbell
Copy link
Member

There is some risk here that this will result in thrashing connections to people's databases if the writer is in read-only mode or we are connected to a reader, and re-connecting doesn't change that.

@cosmastech
Copy link
Contributor Author

There is some risk here that this will result in thrashing connections to people's databases if the writer is in read-only mode or we are connected to a reader, and re-connecting doesn't change that.

Any idea what might be a better solution? We could create a new trait DetectsReadOnlyMode and employ it on the worker (so that it's not thrashing when calling DetectsLostConnections@causedByLostConnection() from within Connector or ManagesTransactions)

That said, from a quick scan at the current errors checked within DetectLostConnections, isn't there a chance of thrashing as described already?

'running with the --read-only option so it cannot execute this statement' and 'Reason: Server is in script upgrade mode. Only administrator can connect at this time.'

@peterlupu
Copy link
Contributor

Just from the top of my head, what about on the first error encountered we reconnect and if we get the same read-only error, we fall back to current behavior of throwing the exception and failing?

@cosmastech
Copy link
Contributor Author

Just from the top of my head, what about on the first error encountered we reconnect and if we get the same read-only error, we fall back to current behavior of throwing the exception and failing?

@peterlupu I think that's a good approach, but I feel like it wouldn't make sense to apply it to just this one exception. Anything that is considered recoverable should be tried N times before finally bubbling the exception up to the client code.

Would love to get some feedback on which cases those are.

@deleugpn
Copy link
Contributor

deleugpn commented Oct 7, 2023

Reading the linked issue, it sounds to me this is a Worker problem and not a connection problem. A Queue Worker is an infinite loop process and as such it makes itself resilient by having an eager try/catch that prevents it from dying. But some situations are good reason to cause the infinite loop to die.

i.e. suppose your Queue Worker is writing some logs to local storage. If your disk reaches 100%, it's pointless to keep trying to work anything as everything will always be in a failing state. This particular issue is better fixed in a way that the infinite loop of the Worker is able to detect that there was a read-only connection. Then the worker can decide whether it wants to try reconnecting and/or simply die and let the outside orchestration (ECS, Kubernetes, Supervisord, etc) spin up a fresh worker.

@driesvints
Copy link
Member

@cosmastech are you still working on this?

@cosmastech
Copy link
Contributor Author

@cosmastech are you still working on this?

@driesvints to be honest, from the feedback I got, I didn't know if there was a particular change that needs to be made. I would rather have it opened back up for review from @taylorotwell. He seemed concerned in his comment that I had not tested it at all (which I had), but some other concerns were brought up.

@cosmastech cosmastech marked this pull request as ready for review November 6, 2023 11:19
@taylorotwell
Copy link
Member

This shows zero files changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database queue worker doesn't quit on internal SQL error
6 participants