Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBG-3834 Close connected websocket connections on database close #7157

Merged
merged 5 commits into from
Oct 21, 2024

Conversation

adamcfraser
Copy link
Collaborator

@adamcfraser adamcfraser commented Oct 14, 2024

CBG-3834

Adds a cancellation context to database context, and passes this to blip context when receiving blipsync requests. On database close, closes cancellation context to trigger close of blip sync connections.

Have validated with android test app that CBL reconnects when database comes online, performing the following steps:

  1. Start client replication, verify replication traffic
  2. Update the database to modify the sync function (triggers a close of the database). Observed client replication closing when database was stopped, and reconnecting when database was restarted.
  3. Verified replication traffic (ensures blip connection wasn't using stale connection to closed bucket)

Integration Tests

@@ -208,7 +208,7 @@ func connect(arc *activeReplicatorCommon, idSuffix string) (blipSender *blip.Sen
arc.replicationStats.NumConnectAttempts.Add(1)

var originPatterns []string // no origin headers for ISGR
blipContext, err := NewSGBlipContext(arc.ctx, arc.config.ID+idSuffix, originPatterns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd actually want to close connections for ISGR when the database is closed as well?

If we don't want to do this, I think this needs a comment why we don't want to do this, or maybe even passing a context calling https://pkg.go.dev/context#WithoutCancel to make it clear that we don't want it to be cancelled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we are not going to pass in a cancellation context for a passive ISGR peer at this time to limit the scope of this PR.

I think it would be safe to add this here, but it's not needed since DatabaseContext.Close will call DatabaseContext.sgReplicateMgr.Stop which stops all replicators. If you agree with this, I think adding a comment to this code would be helpful, because I'll ask this question again.

@@ -412,6 +414,10 @@ func NewDatabaseContext(ctx context.Context, dbName string, bucket base.Bucket,
UserFunctionTimeout: defaultUserFunctionTimeout,
}

// set up cancellable context based on the background context (context lifecycle should decoupled from the request context)
//
dbContext.CancelContext, dbContext.cancelContextFunc = context.WithCancel(context.Background())
Copy link
Collaborator

@torcolvin torcolvin Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a reasonable argument here that this should work the same as DatabaseContext.exitChanges, maybe even in the same function.

There are two pathways here:

  • DatabaseContext.Close (this is what is described here), which shuts down the database
  • DatabaseContext.TakeDbOffline (this is not really used much to take a local node offline).

In this case, we'd want to cancel the context while the database is offline, and restart it next time we call StartOnlineProcesses.

Suggestions:

  1. Make NewSGBlipContext fail if a non nil context is passed in, so the caller has to pass this in with Sync Gateway, and it would error if it were cancelled and we'd be likely to catch this in our tests.
  2. Possibly write a test for local offline.
  3. Put exitChanges and CancelContext into a common function and explain what the differences are. exitChanges is legacy code that might no longer be necessary to close a changes feed. I don't think you have to fix it now, but I am going to ask about this when I look again. exitChanges doesn't work for all cases because we want the blip connections to close if the connection is open but idle.
  4. Use the ctx that is passed in from NewDatabaseContext or StartOnlineProcesses. We are careful to detach this from the request context already, or the background processes would fail:
    if err := h.server.ReloadDatabaseWithConfig(contextNoCancel, *updatedDbConfig); err != nil {
    /
    err = h.server.ReloadDatabaseWithConfig(contextNoCancel, tmpConfig)
    We can just call WithoutCancel on the context passed in (so it retains any information from where it is passed in, in case we want it later) but then set up a cancellation function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this code correctly before and I think we can limit the scope of this PR.

There are two pathways the code cares about:

  1. Exiting the changes feed from a blip handler. This is done by exitChanges (for takeDBOffline), or the request context cancelling ChangesOptions.ChangesCtx, or the database calling changeListener.Stop which will close changeListener.terminator and then the changes feeds are woken up with changeListener.tapNotifier.Broadcast which will return WaiterCheckTerminated. This has everything to do with closing a goroutine on the server side and nothing to do with closing a connection from a websocket client. The reason that I thought this would close the connection is that this code eventually exits here
    forceClose := generateBlipSyncChanges(bh.loggingCtx, changesDb, channelSet, options, opts.docIDs, func(changes []*ChangeEntry) error {
    , but this doesn't shut down the websocket connection.
  2. This PR is intended to cover closing the connection on potentially idle clients, and I was definitely confusing the two.

I don't know how to make this more clear, so I'm leaving a lot of text here. I think this isn't important to do now, but DatabaseContext.exitChanges would be obsolesced by DatabaseContext.CancelContext if we decided to handle the offline case. I like the idea of leaving behind some type of in code bread crumbs, because I don't think we'll prioritize cleaning this code up. I have spent considerable amount of time looking at this code for a past hanging goroutine in 7b5a401 so I could be suffering from recency bias.

I think a comment as to why you are picking at the start of NewDatabaseContext rather than in StartOnlineProcesses would be something that seemed not like the typical pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment for clarity.

// set up cancellable context based on the background context (context lifecycle should decoupled from the request context)
//
dbContext.CancelContext, dbContext.cancelContextFunc = context.WithCancel(context.Background())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to keep cancellation context in NewDatabaseContext, you'll need:

cleanupFunctions = append(cleanupFunctions, func() {
		dbContext.cancelContextFunc()
})

rest/blip_api_crud_test.go Outdated Show resolved Hide resolved
// TestBlipDatabaseClose verifies that the client connection is closed when the database is closed.
// Starts a continuous pull replication then updates the db to trigger a close.
func TestBlipDatabaseClose(t *testing.T) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a pretty good argument for setting this up with btcRunner := NewBlipTesterClientRunner(t) which will parameterize using 3.x and 4.x messages. Using btcRunner.NewBlipTesterClientOptsWithRT(rt, opts) would allow you to not have to worry about implementing handlers, and you can do btcRunner.StartPullSince.

I think this would be equivalent to the test below but with more common boilerplate.

I think there's a different argument to be made that /db/_blipsync should be called, since requests contexts are at place here, and testing for the actual error status to see that we don't regress. I don't think we yet have good boilerplate for this because while ISGR lets you use a real URL

dbURL, err := url.Parse(server.URL + "/db/_blipsync")
we don't have an easy way to send blip requests.

Copy link
Collaborator

@torcolvin torcolvin Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to not set up real _blipsync here, we can be sure it exits with some error code. Here's some more up to date boilerplate for this test:

unc TestBlipDatabaseClose(t *testing.T) {

	base.SetUpTestLogging(t, base.LevelInfo, base.KeyHTTP, base.KeySync, base.KeySyncMsg, base.KeyChanges, base.KeyCache)
	btcRunner := NewBlipTesterClientRunner(t)
	btcRunner.Run(func(t *testing.T, SupportedBLIPProtocols []string) {
		rt := NewRestTesterPersistentConfig(t)
		defer rt.Close()
		const username = "alice"
		rt.CreateUser(username, []string{"*"})
		btc := btcRunner.NewBlipTesterClientOptsWithRT(rt, &BlipTesterClientOpts{Username: username})
		var blipContextClosed atomic.Bool
		btcRunner.clients[btc.id].pullReplication.bt.blipContext.OnExitCallback = func() {
			blipContextClosed.Store(true)
		}

		// put a doc, and make sure blip connection is established
		markerDoc := "markerDoc"
		markerDocVersion := rt.CreateTestDoc(markerDoc)
		rt.WaitForPendingChanges()
		btcRunner.StartPull(btc.id)

		btcRunner.WaitForVersion(btc.id, markerDoc, markerDocVersion)

		RequireStatus(t, rt.SendAdminRequest(http.MethodDelete, "/{{.db}}/", ""), http.StatusOK)

		require.EventuallyWithT(t, func(c *assert.CollectT) {
			assert.True(c, blipContextClosed.Load())
		}, time.Second*10, time.Millisecond*100)
	})
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test-race failure is from blipContextClosed not being an atomic.

@@ -1520,7 +1520,7 @@ func createBlipTesterWithSpec(tb testing.TB, spec BlipTesterSpec, rt *RestTester
return nil, err
}
// Make BLIP/Websocket connection
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols)
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols, nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test this in our test code as well.

Suggested change
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols, nil)
bt.blipContext, err = db.NewSGBlipContextWithProtocols(base.TestCtx(tb), "", origin, protocols, bt.rt.Database().CancelContext)

Comment on lines +49 to +52
github.com/couchbase/go-blip v0.0.0-20241013214017-67df25c2fb5d h1:bnYkmUky/Hy7m2NaIhkEu2MKpJAULcYm90oku3D+3E4=
github.com/couchbase/go-blip v0.0.0-20241013214017-67df25c2fb5d/go.mod h1:Dz8Keu17/4cjF7hvKYqOjH4pRXOh1CCnzsKlBOJaoJE=
github.com/couchbase/go-blip v0.0.0-20241014142134-cc8d8ebf1949 h1:jwFj/GtyaoACmwnGfan/XW38TBTG1kYboXLZfAqd2VE=
github.com/couchbase/go-blip v0.0.0-20241014142134-cc8d8ebf1949/go.mod h1:Dz8Keu17/4cjF7hvKYqOjH4pRXOh1CCnzsKlBOJaoJE=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go mod tidy will fix this up

@torcolvin torcolvin merged commit 6317f19 into main Oct 21, 2024
41 checks passed
@torcolvin torcolvin deleted the CBG-3834 branch October 21, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants