-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CBG-3834 Close connected websocket connections on database close #7157
Conversation
@@ -208,7 +208,7 @@ func connect(arc *activeReplicatorCommon, idSuffix string) (blipSender *blip.Sen | |||
arc.replicationStats.NumConnectAttempts.Add(1) | |||
|
|||
var originPatterns []string // no origin headers for ISGR | |||
blipContext, err := NewSGBlipContext(arc.ctx, arc.config.ID+idSuffix, originPatterns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd actually want to close connections for ISGR when the database is closed as well?
If we don't want to do this, I think this needs a comment why we don't want to do this, or maybe even passing a context calling https://pkg.go.dev/context#WithoutCancel to make it clear that we don't want it to be cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we are not going to pass in a cancellation context for a passive ISGR peer at this time to limit the scope of this PR.
I think it would be safe to add this here, but it's not needed since DatabaseContext.Close
will call DatabaseContext.sgReplicateMgr.Stop
which stops all replicators. If you agree with this, I think adding a comment to this code would be helpful, because I'll ask this question again.
@@ -412,6 +414,10 @@ func NewDatabaseContext(ctx context.Context, dbName string, bucket base.Bucket, | |||
UserFunctionTimeout: defaultUserFunctionTimeout, | |||
} | |||
|
|||
// set up cancellable context based on the background context (context lifecycle should decoupled from the request context) | |||
// | |||
dbContext.CancelContext, dbContext.cancelContextFunc = context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a reasonable argument here that this should work the same as DatabaseContext.exitChanges
, maybe even in the same function.
There are two pathways here:
DatabaseContext.Close
(this is what is described here), which shuts down the databaseDatabaseContext.TakeDbOffline
(this is not really used much to take a local node offline).
In this case, we'd want to cancel the context while the database is offline, and restart it next time we call StartOnlineProcesses
.
Suggestions:
- Make
NewSGBlipContext
fail if a non nil context is passed in, so the caller has to pass this in with Sync Gateway, and it would error if it were cancelled and we'd be likely to catch this in our tests. - Possibly write a test for local offline.
- Put
exitChanges
andCancelContext
into a common function and explain what the differences are.exitChanges
is legacy code that might no longer be necessary to close a changes feed. I don't think you have to fix it now, but I am going to ask about this when I look again.exitChanges
doesn't work for all cases because we want the blip connections to close if the connection is open but idle. - Use the ctx that is passed in from
NewDatabaseContext
orStartOnlineProcesses
. We are careful to detach this from the request context already, or the background processes would fail:sync_gateway/rest/admin_api.go
Line 616 in 0868626
if err := h.server.ReloadDatabaseWithConfig(contextNoCancel, *updatedDbConfig); err != nil { sync_gateway/rest/admin_api.go
Line 681 in 0868626
err = h.server.ReloadDatabaseWithConfig(contextNoCancel, tmpConfig) WithoutCancel
on the context passed in (so it retains any information from where it is passed in, in case we want it later) but then set up a cancellation function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this code correctly before and I think we can limit the scope of this PR.
There are two pathways the code cares about:
- Exiting the changes feed from a blip handler. This is done by
exitChanges
(for takeDBOffline), or the request context cancellingChangesOptions.ChangesCtx
, or the database callingchangeListener.Stop
which will closechangeListener.terminator
and then the changes feeds are woken up withchangeListener.tapNotifier.Broadcast
which will returnWaiterCheckTerminated
. This has everything to do with closing a goroutine on the server side and nothing to do with closing a connection from a websocket client. The reason that I thought this would close the connection is that this code eventually exits heresync_gateway/db/blip_handler.go
Line 479 in 0868626
forceClose := generateBlipSyncChanges(bh.loggingCtx, changesDb, channelSet, options, opts.docIDs, func(changes []*ChangeEntry) error { - This PR is intended to cover closing the connection on potentially idle clients, and I was definitely confusing the two.
I don't know how to make this more clear, so I'm leaving a lot of text here. I think this isn't important to do now, but DatabaseContext.exitChanges
would be obsolesced by DatabaseContext.CancelContext
if we decided to handle the offline case. I like the idea of leaving behind some type of in code bread crumbs, because I don't think we'll prioritize cleaning this code up. I have spent considerable amount of time looking at this code for a past hanging goroutine in 7b5a401 so I could be suffering from recency bias.
I think a comment as to why you are picking at the start of NewDatabaseContext
rather than in StartOnlineProcesses
would be something that seemed not like the typical pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment for clarity.
// set up cancellable context based on the background context (context lifecycle should decoupled from the request context) | ||
// | ||
dbContext.CancelContext, dbContext.cancelContextFunc = context.WithCancel(context.Background()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to keep cancellation context in NewDatabaseContext
, you'll need:
cleanupFunctions = append(cleanupFunctions, func() {
dbContext.cancelContextFunc()
})
// TestBlipDatabaseClose verifies that the client connection is closed when the database is closed. | ||
// Starts a continuous pull replication then updates the db to trigger a close. | ||
func TestBlipDatabaseClose(t *testing.T) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a pretty good argument for setting this up with btcRunner := NewBlipTesterClientRunner(t)
which will parameterize using 3.x and 4.x messages. Using btcRunner.NewBlipTesterClientOptsWithRT(rt, opts)
would allow you to not have to worry about implementing handlers, and you can do btcRunner.StartPullSince
.
I think this would be equivalent to the test below but with more common boilerplate.
I think there's a different argument to be made that /db/_blipsync should be called, since requests contexts are at place here, and testing for the actual error status to see that we don't regress. I don't think we yet have good boilerplate for this because while ISGR lets you use a real URL
dbURL, err := url.Parse(server.URL + "/db/_blipsync") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to not set up real _blipsync here, we can be sure it exits with some error code. Here's some more up to date boilerplate for this test:
unc TestBlipDatabaseClose(t *testing.T) {
base.SetUpTestLogging(t, base.LevelInfo, base.KeyHTTP, base.KeySync, base.KeySyncMsg, base.KeyChanges, base.KeyCache)
btcRunner := NewBlipTesterClientRunner(t)
btcRunner.Run(func(t *testing.T, SupportedBLIPProtocols []string) {
rt := NewRestTesterPersistentConfig(t)
defer rt.Close()
const username = "alice"
rt.CreateUser(username, []string{"*"})
btc := btcRunner.NewBlipTesterClientOptsWithRT(rt, &BlipTesterClientOpts{Username: username})
var blipContextClosed atomic.Bool
btcRunner.clients[btc.id].pullReplication.bt.blipContext.OnExitCallback = func() {
blipContextClosed.Store(true)
}
// put a doc, and make sure blip connection is established
markerDoc := "markerDoc"
markerDocVersion := rt.CreateTestDoc(markerDoc)
rt.WaitForPendingChanges()
btcRunner.StartPull(btc.id)
btcRunner.WaitForVersion(btc.id, markerDoc, markerDocVersion)
RequireStatus(t, rt.SendAdminRequest(http.MethodDelete, "/{{.db}}/", ""), http.StatusOK)
require.EventuallyWithT(t, func(c *assert.CollectT) {
assert.True(c, blipContextClosed.Load())
}, time.Second*10, time.Millisecond*100)
})
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test-race failure is from blipContextClosed
not being an atomic.
@@ -1520,7 +1520,7 @@ func createBlipTesterWithSpec(tb testing.TB, spec BlipTesterSpec, rt *RestTester | |||
return nil, err | |||
} | |||
// Make BLIP/Websocket connection | |||
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols) | |||
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should test this in our test code as well.
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols, nil) | |
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols, bt.rt.Database().CancelContext) |
github.com/couchbase/go-blip v0.0.0-20241013214017-67df25c2fb5d h1:bnYkmUky/Hy7m2NaIhkEu2MKpJAULcYm90oku3D+3E4= | ||
github.com/couchbase/go-blip v0.0.0-20241013214017-67df25c2fb5d/go.mod h1:Dz8Keu17/4cjF7hvKYqOjH4pRXOh1CCnzsKlBOJaoJE= | ||
github.com/couchbase/go-blip v0.0.0-20241014142134-cc8d8ebf1949 h1:jwFj/GtyaoACmwnGfan/XW38TBTG1kYboXLZfAqd2VE= | ||
github.com/couchbase/go-blip v0.0.0-20241014142134-cc8d8ebf1949/go.mod h1:Dz8Keu17/4cjF7hvKYqOjH4pRXOh1CCnzsKlBOJaoJE= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go mod tidy will fix this up
Adds a cancellable context to DatabaseContext that’s closed on database Close. Sets this context on any blipContexts created by the database.
CBG-3834
Adds a cancellation context to database context, and passes this to blip context when receiving blipsync requests. On database close, closes cancellation context to trigger close of blip sync connections.
Have validated with android test app that CBL reconnects when database comes online, performing the following steps:
Integration Tests
GSI=true,xattrs=true
https://jenkins.sgwdev.com/job/SyncGateway-Integration/2750/