Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a default timeout for Connector.Connect #235

Open
manadart opened this issue Mar 20, 2023 · 20 comments
Open

Add a default timeout for Connector.Connect #235

manadart opened this issue Mar 20, 2023 · 20 comments
Labels
Bug Confirmed to be a bug Incomplete Waiting on more information from reporter

Comments

@manadart
Copy link

Juju has occasion to change the bind address of the initial Dqlite node upon entering HA.

When this happens, existing connections keep trying to connect, with our logs repeated output like this:

2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused

An existing call to Exec is locked up at this point:

#       0x46a8d4        time.Sleep+0x134                                                                /u
sr/local/go/src/runtime/time.go:195
#       0x1c14f28       github.com/canonical/go-dqlite/internal/protocol.makeRetryStrategies.func1+0x48 /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:321
#       0x1c12799       github.com/Rican7/retry.shouldAttempt+0x39                                      /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:32
#       0x1c126f6       github.com/Rican7/retry.Retry+0x96                                              /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:19
#       0x1c131c4       github.com/canonical/go-dqlite/internal/protocol.(*Connector).Connect+0xa4      /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:74
#       0x1c42628       github.com/canonical/go-dqlite/driver.(*Connector).Connect+0x268                /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/driver/driver.go:261
#       0x1c05d22       database/sql.(*DB).conn+0x762                                                   /u
sr/local/go/src/database/sql/sql.go:1393
#       0x1c0ce52       database/sql.(*Stmt).connStmt+0x252                                             /u
sr/local/go/src/database/sql/sql.go:2724
#       0x1c0c364       database/sql.(*Stmt).ExecContext+0x184                                          /u
sr/local/go/src/database/sql/sql.go:2628
#       0x46e7204       database/sql.(*Stmt).Exec+0x44                                                  /u
sr/local/go/src/database/sql/sql.go:2651

This is problematic given Juju's architecture, because our worker loops can be blocked from actually shutting down.

Particularly bad is that this happens with db.Ping, which we use for health checks.

@cole-miller
Copy link
Contributor

It looks like the app module doesn't support tweaking the driver parameters that control that retry loop. We can fix that easily enough. You're not seeing the affected connection retry forever, right? With the default parameters it should be capped at 1 second:

driver, err := driver.New(store, driver.WithDialFunc(driverDial), driver.WithLogFunc(o.Log))

if config.BackoffCap == 0 {
config.BackoffCap = time.Second
}

@SimonRichardson
Copy link
Member

It might be prudent to also wrap all the contexts in the driver with a timeout as well. Unfortunately, we can't provide our own contexts in those locations, so a context wrapped in a WithTimeout of 30 seconds or greater. If a commit can't complete within 30 seconds then it's probably never going to succeed and we should probably give up then.

go-dqlite/driver/driver.go

Lines 517 to 525 in fd457b7

func (tx *Tx) Commit() error {
ctx := context.Background()
if _, err := tx.conn.ExecContext(ctx, "COMMIT", nil); err != nil {
return driverError(tx.log, err)
}
return nil
}

I don't mind providing a patch to the driver package for this.

@tomponline
Copy link
Member

tomponline commented Jun 7, 2023

You can do something like this to set a default timeout if not specified by the passed context:

	_, ok := ctx.Deadline()
	if !ok {
		// Set default timeout of 30s if no deadline context provided.
		var cancel context.CancelFunc
		ctx, cancel = context.WithTimeout(ctx, time.Duration(30*time.Second))
		defer cancel()
	}

That way if someone does want a longer timeout you're not forcing a particular policy in the driver.

@tomponline
Copy link
Member

tomponline commented Jun 7, 2023

FWIW LXD sets a default timeout on the transaction rather than each query:

https://github.com/lxc/lxd/blob/master/lxd/db/query/transaction.go#L15-L16

Which has been useful while we transition to passing context into each query too.

It also avoids us accidentally keeping transactions open too long and blocking writes to the DB.

@SimonRichardson
Copy link
Member

@tomponline The problem is that dqlite hangs forever when the database is unreachable. If this happens during an Exec or ExecContext then it will block. The problem I'm showing is that this is still a possibility because the driver does a COMMIT (there are many more) with a context that isn't bound to a timeout. So can hang.

You can do something like this to set a default timeout if not specified by the passed context:

Except the context isn't available via the *Tx and the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.

FWIW LXD sets a default timeout on the transaction rather than each query:

We do the same thing in Juju.

https://github.com/juju/juju/blob/09db847e363a757089dcb569158d7ea0dc3a3a7a/database/txn/transaction.go#L172-L173

@tomponline
Copy link
Member

Does cancelling the transaction not end the query too. I've not observed LXD getting stuck on Execs when the transaction is cancelled.

@tomponline
Copy link
Member

tomponline commented Jun 7, 2023

Except the context isn't available via the *Tx and the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.

Agreed. I suppose if using Exec and not ExecContext one would not expect there to be a default timeout in the driver and indefinite blocking is a possibility (in any driver, not just dqlite).

The reason I added the 10s timeout to BeginTx in LXD was I found that it did cause tx.Exec() to stop if dqlite was blocking for some reason. Because we did see some indefinite blocking before that and dont see them anymore.

@tomponline
Copy link
Member

tomponline commented Jun 7, 2023

Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn't have a deadline?

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

@SimonRichardson
Copy link
Member

So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.

if deadline, ok := ctx.Deadline(); ok {
p.conn.SetDeadline(deadline)
budget = time.Until(deadline)
defer p.conn.SetDeadline(time.Time{})
}

@SimonRichardson
Copy link
Member

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

Sounds like a plan to me.

@freeekanayaka
Copy link
Contributor

So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.

if deadline, ok := ctx.Deadline(); ok {
p.conn.SetDeadline(deadline)
budget = time.Until(deadline)
defer p.conn.SetDeadline(time.Time{})
}

Am I correct that the only place where a user can't provide a context is tx.Commit() ?

Not sure why the go/sql package has chosen to not support contexts for tx.Commit() (and I don't see any newly introduced extension that changes that), perhaps because that is kind of dangerous and leads to undefined behavior in general (has the tx succeeded or not?). In any case a COMMIT query in dqlite can't hang forever, because at some point (relatively quickly with default values, a few seconds) the leader will step down and stop trying to commit if it can't reach a majority.

@SimonRichardson are you positively seeing a tx.Commit() call that hangs?

The logs attached in the very first message of this issue seem a retry loop for getting a driver.Conn, which I would say is caused by Connector.Connect() not being passed a ctx with a deadline. I didn't check Go's code, but I'd expect the ctx passed to Connector.Connect() to be the same as whatever the user has passed to tx.ExecContext and friends.

If tx.Exec() is being used instead of tx.ExecContext() that might explaining it, and we may want to add a default timeout to our Connector.Connect() implementation, as also suggested by the Go documentation for Connector.Connect:

        // The provided context.Context is for dialing purposes only
	// (see net.DialContext) and should not be stored or used for
	// other purposes. A default timeout should still be used
	// when dialing as a connection pool may call Connect
	// asynchronously to any query.

I'm a not totally convinced that a default timeout should be added to protocol.Call() though, as proposed here, since as said tx.Commit() should not hang, and everything else can be controlled by the user, unless I'm missing something.

@freeekanayaka
Copy link
Contributor

Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn't have a deadline?

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

As mentioned, as far as I understand it the only place where we would need to add a default timeout to achieve the behavior above should be protocol.Connector.Connect().

@SimonRichardson
Copy link
Member

are you positively seeing a tx.Commit() call that hangs?

Not 100% positive yet.

@freeekanayaka
Copy link
Contributor

FWIW I kind of see why you wouldn't want tx.Commit() to get a context parameter. Essentially it seems that what go/sql supports is effectively sugar around transaction rollback (you just cancel the context you pass to the various tx methods), which is a normal database operation. On the other hand, once a COMMIT has been issued there's usually no way to rollback, and it either succeeds or fail.

If Juju's main use case for cancellation/rollback is graceful shutdown (e.g. of a worker), then I'd say that a top-level context that you can cancel would be the way to go (as opposed to timeouts), and I presume that's already the case in Juju's code. Assuming that's true, as long as all juju's database calls get passed a context linked to that top-level one, then all transactions should already immediately abort/rollback as soon as the top-level context is cancelled (except for tx.Commit() which will finish in at most a few seconds, possibly returning an error due to cluster unavailability). Relying on timeouts for graceful shutdown would actually be sub-optimal. If juju's database calls are not already all making use of context.Context, it'd be nice to change that, in order to support a more streamlined graceful shutdown sequence.

@SimonRichardson
Copy link
Member

and I presume that's already the case in Juju's code

This is correct.

If juju's database calls are not already all making use of context.Context, it'd be nice to change that

We're using context.

@freeekanayaka
Copy link
Contributor

and I presume that's already the case in Juju's code

This is correct.

If juju's database calls are not already all making use of context.Context, it'd be nice to change that

We're using context.

Ok, so assuming that the only hang observed is the one originally pasted at the very beginning of this issue (the connect loop of protocol.Connector.Connect()), we'd need to investigate why the top-level Juju's context cancellation is not propagated down to that loop.

@SimonRichardson
Copy link
Member

Let me get back to you on this. I've not been able replicate it.

@freeekanayaka
Copy link
Contributor

First thing to check on the go-dqlite site would that cancelling a context actually indirectly interrupts that loop. Roughly something like:

  • Run some successful queries against an in-memory cluster of dqlite.Node
  • Stop the dqlite cluster calling dqlite.Node.Close() against all nodes, but leave the db object open
  • Span a goroutine that calls db.ExecContext("...", ctx)
  • Cancel the ctx and check that it stops the connect loop that the go/sql code should have triggered by calling protocol.Connector.Connect() in order to get a working connection.
  • The goroutine should terminate immediately

@MathieuBordere MathieuBordere added Bug Confirmed to be a bug Incomplete Waiting on more information from reporter labels Jun 12, 2023
@MathieuBordere
Copy link

Sorry, I'm just now starting to look at this, for my understanding

#       0x46e7204       database/sql.(*Stmt).Exec+0x44                                                  /u
sr/local/go/src/database/sql/sql.go:2651

Isn't it expected that you hang forever if you don't use ExecContext? Can't you replace existing Exec calls with ExecContext? Same with Ping, can't you use PingContext?

I agree though that it's prudent to add a default context with a sane timeout to protocol.Connector.Connect.

@manadart
Copy link
Author

Yes, we are using <thing>Context exclusively now.

I leave it to discretion of the Dqlite team as to the specific mitigation, but I feel this behaviour would violate the assumptions of consumers.

@cole-miller cole-miller self-assigned this Sep 18, 2023
@cole-miller cole-miller changed the title sql.DB instances hang forever when database is unreachable Add a default timeout for Connector.Connect Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug Incomplete Waiting on more information from reporter
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants