Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draining SQL connections, as is done during CRDB graceful updates, can lead to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB #67071

Closed
joshimhoff opened this issue Jun 30, 2021 · 15 comments
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking. sync-me-8 T-server-and-security DB Server & Security

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jun 30, 2021

Describe the problem

As discovered by @rafiss & @otan, draining SQL connections, as is done during CRDB graceful updates, leads to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB.

Our understanding of the mechanics is as follows:

  • There is a drain_wait period (default 0s), where connections can no longer be established but existing connections motor on. /healthz will return unavailable, so LBs can mark the node as unavailable.
  • Then there is query_wait (default 10s), where CRDB attempts to close connections that go from non-idle to idle during query execution.
  • After query_wait, CRDB closes open SQL connections - idle or not.
  • There is no communicating to the client to stop using the soon-to-be-closed connection over the postgres wire protocol, in the sense that the postgres wire protocol doesn't provide a means of doing this that connection pool implementations understand (it wasn't build for a distributed setting).
  • Though it appears to be possible to check if a TCP connection is closed without sending / receiving over the network (https://stackoverflow.com/questions/12741386/how-to-know-tcp-connection-is-closed-in-net-package), some connection pool implementations (at least hikari) don't do this.
  • As a result, eventually the client reads from the closed conn, hitting I/O timeout errors, or the error we return right before closing the SQL connection:
    // If we're draining, let the client know by piling on an AdminShutdownError
    // and flushing the buffer.
    if draining() {
    // TODO(andrei): I think sending this extra error to the client if we also
    // sent another error for the last query (like a context canceled) is a bad
    // idead; see #22630. I think we should find a way to return the
    // AdminShutdown error as the only result of the query.
    _ /* err */ = writeErr(ctx, &sqlServer.GetExecutorConfig().Settings.SV,
    newAdminShutdownErr(ErrDrainingExistingConn), &c.msgBuilder, &c.writerState.buf)
    _ /* n */, _ /* err */ = c.writerState.buf.WriteTo(c.conn)
    }
    .

Note there are some mitigating factors, which can sort of be understood as a way fo communicating a conn is not be used to the client, but there are races and so some availability blip can still happen:

In summary, this failure mode is prob present in all or almost all conn pool setups, but magnitude of unavailability blip prob depends on conn pool details & workload details.

Without additional changes to CRDB (which are discussed in the below doc):

  1. To def avoid: One can set a connection pool max lifetime less than drain_wait + estimate of max txn time + estimate of how long takes to remove nodes from the LB's list of healthy nodes, so that by the time drain_wait seconds pass, the client has closed all connections to the draining node itself. This means a mix of longer drain time & shorter connection lifetimes; the latter means higher CPU usage esp. if you do password authentication; the former has disadvantages including greater time in a mixed version state.
  2. One can accept the small blip in availability, as if our understanding is correct, it should be a relatively small blip.

There is a longer form doc about this from @otan and @rafiss: https://docs.google.com/document/d/1oOQ4pnTxEdUPA69qlK5NLCk0C-RldIJeXdlcY1HE3nY/edit#

To Reproduce

From @rafiss re: hikari: https://docs.google.com/document/d/1b7ENgM36xYepaQ-PPG5PESl133m7YcPPatayRGGuAkM/edit

Expected behavior
CRBD markets itself as having a zero downtime upgrade process & this ticket points out that to achieve this goal, connection pool settings + CRDB settings need to be set carefully, and also in a way that has definite downsides.

Additional context
This is affecting a CC production customer.

gz#8424

Epic CRDB-10458

Jira issue: CRDB-8347

@joshimhoff joshimhoff added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-server-and-security DB Server & Security labels Jun 30, 2021
@joshimhoff joshimhoff changed the title Draining SQL connections, as is done during CRDB graceful updates, leads to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB. Draining SQL connections, as is done during CRDB graceful updates, leads to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB Jun 30, 2021
@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jun 30, 2021

I have various Qs about this still:

Though it appears to be possible to check if a TCP connection is closed without sending / receiving over the network (https://stackoverflow.com/questions/12741386/how-to-know-tcp-connection-is-closed-in-net-package), some connection pool implementations (possibly/probably (?) many/all but at least hikari) don't do this.

Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found? That seems possible to me still & could explain why we don't see evidence of this happening widely, despite many customers running in production.

Post hitting this error, connection pool implementations stop using the closed connection, but the above issues do bubble up the application layer, so there is some impact on availability.

I don't really understand what hirari & other connection pool implementations do here. When they hit I/O timeout errors specifically, they then check health of connection & if that second check fails, then they recycle the connection? In a way this ties into my first Q. I wouldn't mind links to the code to answer this second Q.

Perhaps @rafiss has thoughts (tho no rush Rafi... I know you are busy).

@rafiss
Copy link
Collaborator

rafiss commented Jun 30, 2021

Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found?

One thing missing from the issue summary here is that connection pools already check for the health of a connection before using it. Specifically, here is the order of events when the errors happen.

  • Application tries to borrow a connection from its connection pool.
  • Connection pool picks a connection and checks that the connection is healthy.
  • This connection is given to the application.
  • CRDB node begins shutdown process, and it closes the connection
  • Application tries to use the connection that was previously determined to be healthy, but the connection is closed now.

Here is the getConnection code in Hikari. Note that it calls isConnectionAlive. And here you can see the isConnectionAlive implementation.

After the connection pool gives the connection to the application, then the connection pool can no longer perform the check you are proposing. The application itself could perform the check each time it tries to use the connection (a connection which was already deemed "healthy"), but that would be incredibly onerous in the code.

But that brings me to the next point -- how would doing the check even help? The client will end up in the same situation:

  • (currently): it tries to use the connection, that fails, and the error is bubbled up
  • (your proposal): it checks if the TCP connection is working; if it's not... then what? this would be another error that would need to be bubbled up.

The important point is that there's no general retry logic that can be used. If the application is in the middle of a transaction when it gets the error, then the connection pool wouldn't know that it has to retry all the previous statements from that transaction. Or if the application had already finished one transaction, but got the error when doing a second, the connection pool wouldn't know what's already succeeded and what hasn't.


I don't really understand what hirari & other connection pool implementations do here. When they hit I/O timeout errors specifically, they then check health of connection & if that second check fails, then they recycle the connection? In a way this ties into my first Q. I wouldn't mind links to the code to answer this second Q.

The pool always is checking the health of a connection when it gives it to the application. Specifically in Hikari, this block will return false for isConnectionAlive. Also, more to your question, if an error happens while the connection is in use, then the logic of Hikari's ProxyConnection will evict that connection from the pool.


It would be helpful to understand what we want out of this issue beyond what is described in #66319

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jun 30, 2021

It would be helpful to understand what we want out of this issue beyond what is described in #66319

I think we want a shared mental model of the problem, rather than a feature request that may improve the situation.

If the application is in the middle of a transaction when it gets the error, then the connection pool wouldn't know that it has to retry all the previous statements from that transaction.

The fact that the need to close a conn in the middle of a txn leads to an unavailability blip makes sense to me. Customers need to not have overly long running queries if they want zero downtime. We can make that recommendation, that is, txns must finish within query_wait seconds (or something like that).

I am thinking only about idle conns. The closing of an idle conn can currently lead to an availability blip, and I don't understand why this is necessary.

Or if the application had already finished one transaction, but got the error when doing a second, the connection pool wouldn't know what's already succeeded and what hasn't.

Perhaps this is the answer to my Q but I don't understand it.

This connection is given to the application.

Or this? Is a specific connection literally returned to the application always in hikari land? Let's look at https://golang.org/pkg/database/sql/#DB. You can ask for a specific conn via https://golang.org/pkg/database/sql/#DB.Conn but why would you do that? That opens you up this failure mode & also failures when nodes dies randomly. But if you query always via *DB (e.g. https://golang.org/src/database/sql/sql.go?s=42508:42577#L1538), then you are not committing to sending over a specific conn. In this case, if the someone begins executing a SQL query and that query immediately fails in a way that indicates the connection has been closed, I don't see why the database/sql can't retry without bubbling the error to the application layer, simply by choosing another conn. Do you see what I mean? Or am I getting confused?

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jun 30, 2021

Trying to understand whether this issue could hit a user of golang database/sql & why. This is interesting:

// ExecContext executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) ExecContext(ctx context.Context, query string, args ...interface{}) (Result, error) {
	var res Result
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		res, err = db.exec(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.exec(ctx, query, args, alwaysNewConn)
	}
	return res, err
}

Drilling deeper:

func (db *DB) exec(ctx context.Context, query string, args []interface{}, strategy connReuseStrategy) (Result, error) {
	dc, err := db.conn(ctx, strategy)
	if err != nil {
		return nil, err
	}
	return db.execDC(ctx, dc, dc.releaseConn, query, args)
}

Also this:

// maxBadConnRetries is the number of maximum retries if the driver returns
// driver.ErrBadConn to signal a broken connection before forcing a new
// connection to be opened.
const maxBadConnRetries = 2

There is a retry loop for getting a usable connection.

From https://golang.org/src/database/sql/sql.go.

What does driver.ErrBadConn mean exactly?

// conn returns a newly-opened or cached *driverConn.
func (db *DB) conn(ctx context.Context, strategy connReuseStrategy) (*driverConn, error) {
	db.mu.Lock()
	if db.closed {
		db.mu.Unlock()
		return nil, errDBClosed
	}
	// Check if the context is expired.
	select {
	default:
	case <-ctx.Done():
		db.mu.Unlock()
		return nil, ctx.Err()
	}
	lifetime := db.maxLifetime

	// Prefer a free connection, if possible.
	numFree := len(db.freeConn)
	if strategy == cachedOrNewConn && numFree > 0 {
		conn := db.freeConn[0]
		copy(db.freeConn, db.freeConn[1:])
		db.freeConn = db.freeConn[:numFree-1]
		conn.inUse = true
		if conn.expired(lifetime) {
			db.maxLifetimeClosed++
			db.mu.Unlock()
			conn.Close()
			return nil, driver.ErrBadConn
		}
		db.mu.Unlock()

		// Reset the session if required.
		if err := conn.resetSession(ctx); err == driver.ErrBadConn {
			conn.Close()
			return nil, driver.ErrBadConn
		}

		return conn, nil
	}

	// Out of free connections or we were asked not to use one. If we're not
	// allowed to open any more connections, make a request and wait.
	if db.maxOpen > 0 && db.numOpen >= db.maxOpen {
		// Make the connRequest channel. It's buffered so that the
		// connectionOpener doesn't block while waiting for the req to be read.
		req := make(chan connRequest, 1)
		reqKey := db.nextRequestKeyLocked()
		db.connRequests[reqKey] = req
		db.waitCount++
		db.mu.Unlock()

		waitStart := nowFunc()

		// Timeout the connection request with the context.
		select {
		case <-ctx.Done():
			// Remove the connection request and ensure no value has been sent
			// on it after removing.
			db.mu.Lock()
			delete(db.connRequests, reqKey)
			db.mu.Unlock()

			atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))

			select {
			default:
			case ret, ok := <-req:
				if ok && ret.conn != nil {
					db.putConn(ret.conn, ret.err, false)
				}
			}
			return nil, ctx.Err()
		case ret, ok := <-req:
			atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))

			if !ok {
				return nil, errDBClosed
			}
			// Only check if the connection is expired if the strategy is cachedOrNewConns.
			// If we require a new connection, just re-use the connection without looking
			// at the expiry time. If it is expired, it will be checked when it is placed
			// back into the connection pool.
			// This prioritizes giving a valid connection to a client over the exact connection
			// lifetime, which could expire exactly after this point anyway.
			if strategy == cachedOrNewConn && ret.err == nil && ret.conn.expired(lifetime) {
				db.mu.Lock()
				db.maxLifetimeClosed++
				db.mu.Unlock()
				ret.conn.Close()
				return nil, driver.ErrBadConn
			}
			if ret.conn == nil {
				return nil, ret.err
			}

			// Reset the session if required.
			if err := ret.conn.resetSession(ctx); err == driver.ErrBadConn {
				ret.conn.Close()
				return nil, driver.ErrBadConn
			}
			return ret.conn, ret.err
		}
	}

	db.numOpen++ // optimistically
	db.mu.Unlock()
	ci, err := db.connector.Connect(ctx)
	if err != nil {
		db.mu.Lock()
		db.numOpen-- // correct for earlier optimism
		db.maybeOpenNewConnections()
		db.mu.Unlock()
		return nil, err
	}
	db.mu.Lock()
	dc := &driverConn{
		db:         db,
		createdAt:  nowFunc(),
		returnedAt: nowFunc(),
		ci:         ci,
		inUse:      true,
	}
	db.addDepLocked(dc, dc)
	db.mu.Unlock()
	return dc, nil
}

Focusing on the use an existing free conn bit:

	// Prefer a free connection, if possible.
	numFree := len(db.freeConn)
	if strategy == cachedOrNewConn && numFree > 0 {
		conn := db.freeConn[0]
		copy(db.freeConn, db.freeConn[1:])
		db.freeConn = db.freeConn[:numFree-1]
		conn.inUse = true
		if conn.expired(lifetime) {
			db.maxLifetimeClosed++
			db.mu.Unlock()
			conn.Close()
			return nil, driver.ErrBadConn
		}
		db.mu.Unlock()

		// Reset the session if required.
		if err := conn.resetSession(ctx); err == driver.ErrBadConn {
			conn.Close()
			return nil, driver.ErrBadConn
		}

		return conn, nil
	}

Return driver.ErrBadConn if conn is too old or if "resetting the session" returns driver.ErrBadConn. Maybe calling resetSession fails with driver.ErrBadConn in case CRDB has closed the conn on the server side already? In which case only way to hit this issue when using *DB's ExecContext is a race where resetSession runs right before CRBD closes all conns.

Looks about right:

// SessionResetter may be implemented by Conn to allow drivers to reset the
// session state associated with the connection and to signal a bad connection.
type SessionResetter interface {
	// ResetSession is called prior to executing a query on the connection
	// if the connection has been used before. If the driver returns ErrBadConn
	// the connection is discarded.
	ResetSession(ctx context.Context) error
}

Specifically this part in the docs: "... to signal a bad connection".

From https://golang.org/src/database/sql/driver/driver.go

In which case only way to hit this issue when using *DB's ExecContext is a race where resetSession runs right before CRBD closes all conns.

So a user of database/sql can be affected. But perhaps not very likely?

OTOH, if you call Connection (https://golang.org/pkg/database/sql/#DB.Conn) & hold onto it for some time, you can increase the likelihood of being affected by above bug quite a lot, as there is more time for conn to be closed between resetSession call and the using of the conn. Minimizing how long you hold onto an explicit connection -- or actually ideally avoiding it all together (use *DB instead) -- seems very wise to me. Possibly some of our customers don't do that.

Not sure I have this right but I am learning things at least...

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jun 30, 2021

Also, @rafiss, after reading database/sql, I see what you mean when you say:

One thing missing from the issue summary here is that connection pools already check for the health of a connection before using it.

I'll update the issue summary.

Wondering if you have any thoughts about the database/sql analysis above. In particular this bit as I could see it contributing to the customer issue we are looking at:

OTOH, if you call Connection (https://golang.org/pkg/database/sql/#DB.Conn) & hold onto it for some time, you can increase the likelihood of being affected by above bug quite a lot, as there is more time for conn to be closed between resetSession call and the using of the conn. Minimizing how long you hold onto an explicit connection -- or actually ideally avoiding it all together (use *DB instead) -- seems very wise to me. Possibly some of our customers don't do that.

@joshimhoff joshimhoff changed the title Draining SQL connections, as is done during CRDB graceful updates, leads to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB Draining SQL connections, as is done during CRDB graceful updates, can lead to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB Jun 30, 2021
@joshimhoff
Copy link
Collaborator Author

Now I am looking at the pgx driver for database/sql and I found something very interesting!! https://github.com/jackc/pgx

Look at this beautiful thing:

func (c *Conn) ExecContext(ctx context.Context, query string, argsV []driver.NamedValue) (driver.Result, error) {
	if c.conn.IsClosed() {
		return nil, driver.ErrBadConn
	}

	args := namedValueToInterface(argsV)

	commandTag, err := c.conn.Exec(ctx, query, args...)
	// if we got a network error before we had a chance to send the query, retry
	if err != nil {
		if pgconn.SafeToRetry(err) {
			return nil, driver.ErrBadConn
		}
	}
	return driver.RowsAffected(commandTag.RowsAffected()), err
}

Combine above with below from database/sql:

// ExecContext executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) ExecContext(ctx context.Context, query string, args ...interface{}) (Result, error) {
	var res Result
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		res, err = db.exec(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.exec(ctx, query, args, alwaysNewConn)
	}
	return res, err
}

I think if you use database/sql, pgx as a driver, and query via DB.ExecContext (query against *DB & don't open a txn), you will NOT ever hit the bug described in this ticket. If a conn is closed by CRBD, pgx will return driver.ErrBadConn, leading to a retry. This is what I was hoping for when I wrote:

Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found?

(Maybe there is a race if query happens right as conn is being closed meaning conn is not fully closed yet tho. Sounds like it as per "if we got a network error before we had a chance to send the query".)

Interestingly, I think you CAN hit the failure mode when trying to open a txn, as there is no returning of driver.ErrBadConn (except when you already know the conn is closed without sending over the network (that is, the client closed the conn, not the server)):

func (c *Conn) BeginTx(ctx context.Context, opts driver.TxOptions) (driver.Tx, error) {
	if c.conn.IsClosed() {
		return nil, driver.ErrBadConn
	}

	if pconn, ok := ctx.Value(ctxKeyFakeTx).(**pgx.Conn); ok {
		*pconn = c.conn
		return fakeTx{}, nil
	}

	var pgxOpts pgx.TxOptions
	switch sql.IsolationLevel(opts.Isolation) {
	case sql.LevelDefault:
	case sql.LevelReadUncommitted:
		pgxOpts.IsoLevel = pgx.ReadUncommitted
	case sql.LevelReadCommitted:
		pgxOpts.IsoLevel = pgx.ReadCommitted
	case sql.LevelRepeatableRead, sql.LevelSnapshot:
		pgxOpts.IsoLevel = pgx.RepeatableRead
	case sql.LevelSerializable:
		pgxOpts.IsoLevel = pgx.Serializable
	default:
		return nil, fmt.Errorf("unsupported isolation: %v", opts.Isolation)
	}

	if opts.ReadOnly {
		pgxOpts.AccessMode = pgx.ReadOnly
	}

	tx, err := c.conn.BeginTx(ctx, pgxOpts)
	if err != nil {
		return nil, err
	}

	return wrapTx{ctx: ctx, tx: tx}, nil
}

@joshimhoff
Copy link
Collaborator Author

I updated the issue summary as per feedback from Rafi!

@jordanlewis
Copy link
Member

A very long term idea: we could think about implementing live TCP migrations. There are systems that allow you to move a TCP connection from one node to another, with no interruption to the other end.

https://twitter.com/colmmacc/status/1407820679962054662?s=20

https://ipsj.ixsq.nii.ac.jp/ej/?action=repository_action_common_download&item_id=68424&item_no=1&attribute_id=1&file_no=1

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jul 15, 2021

Soft of relevant database/sql issue posted by @jeffswenson: golang/go#11978

(Maybe there is a race if query happens right as conn is being closed meaning conn is not fully closed yet tho. Sounds like it as per "if we got a network error before we had a chance to send the query".)

In other words, yes, there is a race.

@knz
Copy link
Contributor

knz commented Jul 29, 2021

@rafiss @joshimhoff for my own education, can you tell me how this issue and #66319 differ from each other?

@knz knz added the A-cc-enablement Pertains to current CC production issues or short-term projects label Jul 29, 2021
@joshimhoff
Copy link
Collaborator Author

I think #66319 proposes a modification to CRBD (see title especially) that may help with the issue outlined in this ticket. If we want to merge things somehow, that SGTM, tho I think it's reasonable to not given above. I feel even more that the details in this ticket re: the nature of the problem should be stored somewhere, as it took a long time for me, Rafi, & Olvier get on the same page about the problem, as it is rather subtle.

@rafiss
Copy link
Collaborator

rafiss commented Aug 4, 2021

I wrote a toy test program, and found that using the pgx driver with database/sql reproduces the server is shutting down error. I ran the code below against a local 3-node cluster behind HAProxy, then started a graceful shutdown of one of the nodes.

Expand this to see the `pgx` with `database/sql` code
package main

import (
	"database/sql"
	"fmt"

	_ "github.com/jackc/pgx/v4/stdlib"
	_ "github.com/lib/pq"
	"github.com/shettyh/threadpool"
)

func main() {
	url := "postgresql://root@localhost:26257/defaultdb?sslmode=disable"
	db, err := sql.Open("pgx", url)
	if err != nil {
		panic(err)
	}
	db.SetMaxOpenConns(10)
	db.SetMaxIdleConns(10)

	pool := threadpool.NewThreadPool(10, 100000)

	var results []*threadpool.Future
	for i := 0; i < 1250; i++ {
		t := &theTask{db}
		res, err := pool.ExecuteFuture(t)
		if err != nil {
			panic(err)
		}
		results = append(results, res)
	}

	for i, resF := range results {
		res := resF.Get().(bool)
		activeConns := db.Stats().InUse
		totalConns := db.Stats().OpenConnections
		waitDur := db.Stats().WaitDuration
		maxLifetimeClosed := db.Stats().MaxLifetimeClosed
		fmt.Printf("%t %d activeConns=%d totalConns=%d waitDur=%d maxLifetimeClosed=%d\n",
			res, i, activeConns, totalConns, waitDur.Milliseconds(), maxLifetimeClosed)
	}
}

type theTask struct {
	db *sql.DB
}

func (t *theTask) Call() interface{} {
	var b bool
	row := t.db.QueryRow("SELECT pg_sleep(0.5)")
	if err := row.Scan(&b); err != nil {
		fmt.Printf("error! %v\n", err)
		return false
	}
	if !b {
		panic("not possible")
	}
	return b
}

Note the error does not happen when using database/sql with lib/pq. (In the above code replace sql.Open("pgx", url) with sql.Open("postgres", url).) This is because the error handling at https://github.com/lib/pq/blob/9e747ca50601fcb6c958dd89f4cb8aea3e067767/error.go#L481 seems to return ErrBadConn when CRDB is shutting down, which allows the retry logic that @joshimhoff described above to work.

The error does happen also when using pgx with pgxpool (see next snippet for code). So it looks like the pgx detection for "safe to retry" doesn't handle the way CRDB shuts down. My reading of https://github.com/jackc/pgconn/blob/53f5fed36c570f0b5c98d6ec2415658c7b9bd11c/pgconn.go#L949 is that pgx only marks an error as safe to retry if 0 bytes have been written to the socket.

Expand this to see the `pgx` with `pgxpool` code
package main

import (
	"context"
	"fmt"

	"github.com/jackc/pgx/v4/pgxpool"
	_ "github.com/jackc/pgx/v4/stdlib"
	_ "github.com/lib/pq"
	"github.com/shettyh/threadpool"
)

func main() {
	ctx := context.Background()
	url := "postgresql://root@localhost:26257/defaultdb?sslmode=disable"
	config, _ := pgxpool.ParseConfig(url)
	config.MaxConns = 10
	config.MinConns = 5
	db, err := pgxpool.ConnectConfig(ctx, config)
	if err != nil {
		panic(err)
	}

	pool := threadpool.NewThreadPool(10, 100000)

	var results []*threadpool.Future
	for i := 0; i < 1250; i++ {
		t := &theTask{db}
		res, err := pool.ExecuteFuture(t)
		if err != nil {
			panic(err)
		}
		results = append(results, res)
	}

	for i, resF := range results {
		res := resF.Get().(bool)
		activeConns := db.Stat().AcquiredConns()
		totalConns := db.Stat().IdleConns() + activeConns
		acquireDur := db.Stat().AcquireDuration()
		emptyAcquireCount := db.Stat().EmptyAcquireCount()
		fmt.Printf("%t %d activeConns=%d totalConns=%d acquireDur=%d emptyAcquireCount=%d\n",
			res, i, activeConns, totalConns, acquireDur.Milliseconds(), emptyAcquireCount)
	}
}

type theTask struct {
	db *pgxpool.Pool
}

func (t *theTask) Call() interface{} {
	var b bool
	row := t.db.QueryRow(context.Background(), "SELECT pg_sleep(0.5)")
	if err := row.Scan(&b); err != nil {
		fmt.Printf("error! %v\n", err)
		return false
	}
	if !b {
		panic("not possible")
	}
	return b
}

@joshimhoff
Copy link
Collaborator Author

NICE REPRO!

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Aug 4, 2021

found that using the pgx driver with database/sql reproduces the server is shutting down error

Does it repro both ERROR: server is shutting down AND I/O timeouts post that due to conn closure (as happened with the earlier CC customer)? Otherwise the issue could be more:

But returning ERROR: server is shutting down first means a bit of unavailability ahead of the conn being closed, as database/sql can’t retry around that error. Thoughts, Rafi?

I believe the thinking is that since this is an “admin shutdown” error (see 57P01 on Appendix A. PostgreSQL Error Codes ), the client should be able to reliably retry around this error. But of course that depends on the driver, and I don’t know which drivers are smart about 57P01 and which are not.

That is, perhaps the pgx driver with database/sql doesn't retry around ERROR: server is shutting down but does around trying to send to a closed conn?

We could experiment by patching them to a special build that doesn't ever return ERROR: server is shutting down. Or actually we could just use your repro!!

@exalate-issue-sync exalate-issue-sync bot removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Oct 6, 2021
@rafiss
Copy link
Collaborator

rafiss commented Jun 9, 2022

closing as part of Epic CRDB-10458

@rafiss rafiss closed this as completed Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking. sync-me-8 T-server-and-security DB Server & Security
Projects
None yet
Development

No branches or pull requests

5 participants