ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

tamirms · 2020-12-10T09:23:40Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

This PR implements robust shutdown behavior in CaptiveStellarCore by embedding a context in stellarCoreRunner.

CaptiveStellarCore exposes the following functions:

PrepareRange()
IsPrepared()
GetLedger()
GetLatestLedgerSequence()
Close()

PrepareRange(), IsPrepared(), GetLedger(), and GetLatestLedgerSequence() cannot be called concurrently. Only one go routine should be calling those functions on a CaptiveStellarCore instance. If any of those functions are accessed concurrently then the behavior of CaptiveStellarCore will be undefined.

However, Close() can be called concurrently with any of the functions listed above. In other words, it is safe for Close() to interrupt any pending PrepareRange(), IsPrepared(), GetLedger(), or GetLatestLedgerSequence() operation. Once Close() is called, any pending operation (e.g. PrepareRange() / GetLedger()) is expected to terminate quickly. The Close() operation should also complete in a timely manner.

After Close() is called, the captive core subprocess should be reaped, the captive core named pipe should be closed, the temporary directory used by captive core should be deleted, and all go routines related to buffering captive core logs or streaming ledgers from the named pipe should be terminated.

Note that you can still use a CaptiveStellarCore instance after calling Close(). If you call PrepareRange() after calling Close() then CaptiveStellarCore will create a new captive core subprocess (along with the named pipe, buffering go routines, etc) and begin a new session. [Edit: the behavior of Close() is now terminal. Once it is called the CaptiveStellarCore instance will fail on all subsequent operations]

The CaptiveCoreConfig struct has been extended to include an optional Context field. The context in CaptiveCoreConfig acts as a parent context for all PrepareRange() sessions. In other words, if the parent context is cancelled then, any pending PrepareRange() / GetLedger() operation will be terminated even though Close() was not called. Furthermore, once the parent context is terminated the CaptiveStellarCore instance will effectively be dead (e.g. it will not be possible start any more PrepareRange() sessions).

The behavior described above is implemented by embedding a context in the stellarCoreRunner struct. stellarCoreRunner will pass that context into exec.CommandContext() when spawning the captive core process. When we want to abort the running stellarCoreRunner instance we cancel the context associated with stellarCoreRunner which will kill the captive core subprocess (by calling os.Process.Kill) if the stellar core has not already exited on its own (see https://golang.org/pkg/os/exec/#CommandContext ). By using exec.CommandContext() we are able to remove some of the cleanup code from stellarCoreRunner including the unix and windows implementations of processIsAlive().

Previously, I mentioned that CaptiveStellarCore.Close() is able to safely interrupt any pending CaptiveStellarCore operation. This invariant is guaranteed because CaptiveStellarCore uses a read write lock to protect access to stellarCoreRunner. While the read lock is held, CaptiveStellarCore cannot assign its stellarCoreRunner field to a new value. stellarCoreRunner.close() is also thread safe and idempotent (so it doesn't matter if CaptiveStellarCore.Close() is called multiple times).

Once stellarCoreRunner.close() is invoked it will cancel the stellarCoreRunner context and it will also close the channel used by CaptiveStellarCore to read ledgers from captive core. At this point any blocking calls to GetLedger() will be able to terminate quickly.

The code which determines how GetLedger() terminates on error is encapsulated in the checkMetaPipeResult() function.
There are basically 3 types of errors which can arise in GetLedger():

User initiated shutdown by canceling the parent context or calling Close().
The stellar core process exited unexpectedly.
Some error was encountered while consuming the ledgers emitted by captive core (e.g. parsing invalid xdr)

All these cases are handled in checkMetaPipeResult().

Some other points to note while reading the PR:

the implementation of bufferedLedgerMetaReader is now more decoupled from CaptiveStellarCore and stellarCoreRunner
bufferedLedgerMetaReader is now created and owned by stellarCoreRunner. when stellarCoreRunner terminates, it will dispose of its bufferedLedgerMetaReader instance
CaptiveStellarCore does not have to listen to any exit channels. CaptiveStellarCore only needs to handle the case when the stellarCoreRunner meta pipe is exhausted.
In the previous code the go routine created by stellarCoreRunner to consume stdout and stderr from captive core was never terminated. This PR fixes the go routine leak.
The named pipe is now properly closed by stellarCoreRunner

Known limitations

I need to fix the tests. I plan to do so once I receive approval on the overall approach of the PR.

bartekn

Thanks for this proposal. Here's a list of what I like and don't like. Also added some inline comments. Please note that I haven't run the code so I can be wrong below.

Like:

bufferedMetaPipeReader used in stellarCoreRunner directly. I didn't do that in the PR that introduced bufferedMetaPipeReader because I was thinking that we should keep it separate to give access to raw meta stream because maybe we'll use it in the future. But we can split it again when needed.

Don't like:

Looks like in order to properly close the backend you need to call Close() and cancel() the Context. This can be confusing because Context is not part of the LedgerBackend interface so it will be hard to reuse the code with different backend. It will be a problem even in simplest apps that want to close the backend on CMD+C.
I think that two shutdown signals, in general, create all kinds of new concurrency issues. One example: if we cancel context and call Close() it's possible that process exit error returned by r.cmd.Wait() will be killed instead of ErrCanceled because Close() code executed first (I added an inline comment with example when this can happen). We can add a bunch of code to enforce the order we want but tomb has a nice property of Kill: it doesn't overwrite the error on following calls to Kill so we'll still know the original error.
I added an inline comment to checkMetaPipeResult but wanted to discuss it here too. This is quite complicated because it has to check 3 different errors (stellarCoreRunner.context().Err(), result.err and getProcessExitError()) and ok value. Developer not only need to know the internals of stellarCoreRunner and bufferedMetaPipeReader but also the order in which events execute (context cancellation, channel close, etc.). This was really difficult for me to understand. In tomb you receive a single error (in our case it's nil, context.Cancelled or actual error) and using a single method (so no issues with potential concurrency issues I explained in the inline comment).
In general I don't like we're back to using using wait groups and closing channels which can be easily misused and were misued in the past: Reliability issues in remote-core http server #3228 Crash in captive-core's backend #3159 but not only in our project. tomb can protect us from this pretty well.

One extra note, when I was searching for solutions to this problem I came across: https://dave.cheney.net/2017/08/20/context-isnt-for-cancellation which I think it's worth checking. I don't fully agree with this statement because I think that context cancellation (by timeout/deadline/cancel) is a great way to handle HTTP request cancellation. But in cases when ack is needed and the final error can be overwritten along the way I think context is really hard to maintain.

bartekn · 2020-12-10T12:14:27Z

ingest/ledgerbackend/captive_core_backend.go

 	return false, xdr.LedgerCloseMeta{}, errOut
 }

+func (c *CaptiveStellarCore) checkMetaPipeResult(result metaResult, ok bool) error {


I think it's possible here that in case of context cancellation: c.stellarCoreRunner.context().Err() will be nil (only closed in close() which may not executed yet) but c.stellarCoreRunner.getProcessExitError() will return exited without error (likely killed).

yes, that is a valid case. what's wrong with that scenario?

You don't want GetLedger or PrepareRange to return errors in case of cancellation (except error that specifies this was cancelled). killed error can confuse users that something went wrong and I guess we'll see it every once in a while when shutting down Horizon (we do handler context.Cancelled though and don't log it with error level).

If Close() is called on a CaptiveStellarCore which is running GetLedger() or PrepareRange() the only possible error GetLedger() or PrepareRange() can return is context.Cancelled. There is no scenario where a killed error will be returned

bartekn · 2020-12-10T12:15:03Z

ingest/ledgerbackend/stellar_core_runner.go

+	// drain meta pipe channel to make sure the ledger buffer goroutine exits
+	for range r.getMetaPipe() {
+
 	}


We shouldn't do that. I actually removed this in my PR from existing code because I realized that if GetLedger/PrepareRange is still running it may return ledgers out of order (because some are read in here). That's why this is useful.

I don't see how this is an issue. If we are Closing the captive core instance we don't care about reading the ledgers and we want to terminate as soon as possible.

Note that this is done it two different routines. So if GetLedger will read ledgers out of order it will error.

Note that this is done it two different routines. So if GetLedger will read ledgers out of order it will error.

This will only occur if we're Closing() captive core and in that scenario GetLedger() doesn't care about the ledgers and needs to exit asap

tamirms · 2020-12-10T14:13:59Z

@bartekn

Looks like in order to properly close the backend you need to call Close() and cancel() the Context. This can be confusing because Context is not part of the LedgerBackend interface so it will be hard to reuse the code with different backend. It will be a problem even in simplest apps that want to close the backend on CMD+C.

This isn't quite right. If you call Close() on a captive core instance it will terminate any pending PrepareRange / GetLedger operations. That is guaranteed. You can still create a new session but in order to do so you must call PrepareRange() after calling Close(). The benefit of the parent context is that if you cancel it, the captive core instance becomes completely disabled making it impossible to start any new sessions. This solves the issue I mentioned here #3258 (comment) .

I think that two shutdown signals, in general, create all kinds of new concurrency issues. One example: if we cancel context and call Close() it's possible that process exit error returned by r.cmd.Wait() will be killed instead of ErrCanceled because Close() code executed first (I added an inline comment with example when this can happen). We can add a bunch of code to enforce the order we want but tomb has a nice property of Kill: it doesn't overwrite the error on following calls to Kill so we'll still know the original error.

I think you misunderstood the semantics of processExitError. That variable is supposed to contain the process exit error as determined by cmd.Wait(). If you want to know if a captive core instance was terminated by Close() you can check the context error. In other words, context().Err() != nil if and only if Close() was called.

I added an inline comment to checkMetaPipeResult but wanted to discuss it here too. This is quite complicated because it has to check 3 different errors (stellarCoreRunner.context().Err(), result.err and getProcessExitError()) and ok value. Developer not only need to know the internals of stellarCoreRunner and bufferedMetaPipeReader but also the order in which events execute (context cancellation, channel close, etc.). This was really difficult for me to understand. In tomb you receive a single error (in our case it's nil, context.Cancelled or actual error) and using a single method (so no issues with potential concurrency issues I explained in the inline comment).

I think the comment about the complexity is a fair point. However, I think it's acceptable because this complexity is completely encapsulated within 1 function checkMetaPipeResult(). Also, stellarCoreRunner is not an exported type, it is coupled to CaptiveStellarCore. I view it as an internal type and part of the implementation details of CaptiveStellarCore. I can add some comments to explain the function. I don't agree with your comment about concurrency issues though.

In general I don't like we're back to using using wait groups and closing channels which can be easily misused and were misued in the past: #3228 #3159 but not only in our project. tomb can protect us from this pretty well.

I believe the issues we had in the past could be attributed to the difficulty of working with concurrency in general more so than the specific use of waitgroups or channels. In your PR where you introduced tomb we were still able to find a bunch of concurrency related bugs.

bartekn · 2020-12-10T20:22:48Z

This isn't quite right. If you call Close() on a captive core instance it will terminate any pending PrepareRange / GetLedger operations. That is guaranteed. You can still create a new session but in order to do so you must call PrepareRange() after calling Close(). The benefit of the parent context is that if you cancel it, the captive core instance becomes completely disabled making it impossible to start any new sessions. This solves the issue I mentioned here #3258 (comment) .

OK, I understand now. I was also thinking about it after your comment in #3258 and I think what we could do is to make Close() actually close the ledger backend (any backend) so it's not usable anymore and you need to create a new one. I checked how we use it in Horizon and we don't really need to close it other than when closing the app and I believe this will be the same for other apps. I think this will make the interface much more clear and an extra Context won't be required.

I think you misunderstood the semantics of processExitError. That variable is supposed to contain the process exit error as determined by cmd.Wait(). If you want to know if a captive core instance was terminated by Close() you can check the context error. In other words, context().Err() != nil if and only if Close() was called.

OK, I can see it now after you explained what CaptiveStellarCore.Context is used for.

I don't agree with your comment about concurrency issues though.

Again, it's clear after CaptiveStellarCore.Context explanations. But it's more complicated compare to tomb in my opinion.

I believe the issues we had in the past could be attributed to the difficulty of working with concurrency in general more so than the specific use of waitgroups or channels. In your PR where you introduced tomb we were still able to find a bunch of concurrency related bugs.

Agreed! I meant: in my opinion it's much easier to make mistakes like closing channel twice when operating on channels directly compared to when using tomb because it's API protects from something like this. I'm not saying it's the case here but it may happen when we refactor the code in the future.

tamirms · 2020-12-11T08:00:46Z

@bartekn I have updated the PR to make the context in CaptiveCoreConfig optional. Also, I have changed the behavior of CaptiveStellarCore.Close() to make it terminal. Once CaptiveStellarCore.Close() is invoked the CaptiveStellarCore instance will no longer be functional

ingest/ledgerbackend/buffered_meta_pipe_reader.go

2opremio · 2020-12-17T11:56:04Z

ingest/ledgerbackend/captive_core_backend.go

+	// Context is the (optional) context which controls the lifetime of a CaptiveStellarCore instance. Once the context is done
+	// the CaptiveStellarCore instance will not be able to stream ledgers from Stellar Core or spawn new
+	// instances of Stellar Core. If Context is omitted CaptiveStellarCore will default to using context.Background.
+	Context context.Context


we could consider moving the context so that it's a parameter to PrepareRange() instead. but i'd prefer implementing that change in a follow up PR if necessary

Sure, please do.

2opremio · 2020-12-17T11:56:37Z

ingest/ledgerbackend/captive_core_backend.go

-	if err != nil {
-		return errors.Wrap(err, "error closing existing session")
-	}
-


2opremio · 2020-12-17T11:57:27Z

ingest/ledgerbackend/captive_core_backend.go

-		if len(c.ledgerBuffer.getChannel()) > 0 {
-			break
-		}
-		time.Sleep(c.waitIntervalPrepareRange)


Much cleaner without this. Great!

ingest/ledgerbackend/stellar_core_runner.go

2opremio · 2020-12-17T12:01:37Z

I think this is a great step forward. The code is way cleaner and (IMO) easier to understand.

I am weary of bugs being introduced due to the changes being considerable ... but that's what a beta is for.

I think we should try to do some release testing specifically for this PR.

bartekn · 2020-12-17T12:33:12Z

I think we should try to do some release testing specifically for this PR.

Agree, the plan is to test the last commit when all the remaining PRs are merged (and then master is merged into the release branch). Hopefully this will happen today.

…f captive core (stellar#3278) CaptiveStellarCore exposes the following functions: * PrepareRange() * IsPrepared() * GetLedger() * GetLatestLedgerSequence() * Close() PrepareRange(), IsPrepared(), GetLedger(), and GetLatestLedgerSequence() cannot be called concurrently. Only one go routine should be calling those functions on a CaptiveStellarCore instance. If any of those functions are accessed concurrently then the behavior of CaptiveStellarCore will be undefined. However, Close() can be called concurrently with any of the functions listed above. In other words, it is safe for Close() to interrupt any pending PrepareRange(), IsPrepared(), GetLedger(), or GetLatestLedgerSequence() operation. Once Close() is called, any pending operation (e.g. PrepareRange() / GetLedger()) is expected to terminate quickly. The Close() operation should also complete in a timely manner. After Close() is called, the captive core subprocess should be reaped, the captive core named pipe should be closed, the temporary directory used by captive core should be deleted, and all go routines related to buffering captive core logs or streaming ledgers from the named pipe should be terminated. The behavior of Close() is now terminal. Once it is called the CaptiveStellarCore instance will fail on all subsequent operations. The CaptiveCoreConfig struct has been extended to include an optional Context field. The context in CaptiveCoreConfig acts as a parent context for all PrepareRange() sessions. In other words, if the parent context is cancelled then, any pending PrepareRange() / GetLedger() operation will be terminated even though Close() was not called. Furthermore, once the parent context is terminated the CaptiveStellarCore instance will effectively be dead (e.g. it will not be possible start any more PrepareRange() sessions).

…evious instance termination (#4020) Add a code to `CaptiveCoreBackend.startPreparingRange` that ensures that previously started Stellar-Core instance is not running: check if `getProcessExitError` returns `true` which means that Stellar-Core process is fully terminated. This prevents a situation in which a new instance is started and clashes with the previous one. The existing code contains a bug, likely introduced in 0f2d08b. The context returned by `stellarCoreRunner.context()` is cancelled in `stellarCoreRunner.close()` which initiates the termination process. At the same time, `CaptiveCoreBackend.PrepareRange()` calls `CaptiveCoreBackend.isClosed()` internally that checks which return value depends on `stellarCoreRunner` context being closed. This is wrong because it's possible that Stellar-Core is still not closed even when aforementioned context is cancelled - it can be still closing so the process can be still running. Because of this the following chain of events can lead to two Stellar-Core instances running (briefly) at the same time: 1. Stellar-Core instance is upgraded, triggering `fileWatcher` to call `stellarCoreRunner.close()` which cancels the `stellarCoreRunner.context()`. 2. In another Go routine, `CaptiveBackend.IsPrepared()` is called, which returns `false` because `stellarCoreRunner.context()` is canceled and then calls `CaptiveBackend.PrepareRange()` to restart Stellar-Core. `PrepareRange()` also checks if `stellarCoreRunner.context()` is cancelled (it is but Stellar-Core process can still run shutdown procedure) and then attempts to start a new instance. This commit is really a quick fix. Code before 0f2d08b was simpler because it was calling `Kill()` on a process so "terminating" and "terminated" were exactly the same state. After 0f2d08b there are now two events associated with a Stellar-Core process (as above). Because of this the code requires a larger refactoring. We may reconsider using `tomb` package I tried in #3258 that was later closed in favour of: #3278.

cla-bot bot added the cla: yes label Dec 10, 2020

tamirms force-pushed the fix-captive-core-cleanup branch 7 times, most recently from 9f68992 to 19d5cd1 Compare December 10, 2020 12:46

tamirms changed the title ~~ingest/ledgerbackend: Use contexts to handle termination and cleanup of captive core~~ ingest/ledgerbackend: Use context to handle termination and cleanup of captive core Dec 10, 2020

bartekn reviewed Dec 10, 2020

View reviewed changes

tamirms force-pushed the fix-captive-core-cleanup branch 2 times, most recently from 08fb50e to 9596859 Compare December 10, 2020 13:41

tamirms requested a review from a team December 10, 2020 16:22

tamirms force-pushed the fix-captive-core-cleanup branch 2 times, most recently from 4591ffd to 45c199c Compare December 11, 2020 09:35

tamirms added 9 commits December 13, 2020 14:41

Use context to handle termination and cleanup of captive core

26c924f

Add comments to explain checkMetaPipeResult

5a537d4

clear tempDir after started check

885e4ff

Update comment

cdb08c3

Make Context in config optional and make Close() terminal

0b2d3e8

Set previousLedgerHash to nil in PrepareRange()

ea1e29a

Allow stellarCoreRunner to be closed before catchup / runFrom is called

75d55b2

Close log line writers correctly for new-db command

28616e0

Pass checkpoint frequency to ingestion system configuration

79df8a6

tamirms force-pushed the fix-captive-core-cleanup branch from 45c199c to 79df8a6 Compare December 14, 2020 09:31

run go fmt

1e17858

bartekn approved these changes Dec 14, 2020

View reviewed changes

bartekn mentioned this pull request Dec 14, 2020

ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258

Closed

7 tasks

tamirms added 2 commits December 14, 2020 17:16

rename wait() to handleExit()

bf8a410

fix nil interface bug

6a61fad

tamirms force-pushed the fix-captive-core-cleanup branch from 08005ef to 6a61fad Compare December 14, 2020 16:50

tamirms added 4 commits December 14, 2020 22:31

Use ioutil.TempDir

f4e845e

Fix tests

cc9b604

Add note about thread safety

4f1175e

Call cancel() on error

0db1226

tamirms marked this pull request as ready for review December 16, 2020 18:54

Merge branch 'master' into fix-captive-core-cleanup

7c9ee22

bartekn approved these changes Dec 17, 2020

View reviewed changes

2opremio reviewed Dec 17, 2020

View reviewed changes

ingest/ledgerbackend/buffered_meta_pipe_reader.go Outdated Show resolved Hide resolved

2opremio reviewed Dec 17, 2020

View reviewed changes

ingest/ledgerbackend/stellar_core_runner.go Show resolved Hide resolved

tamirms added 3 commits December 17, 2020 14:13

Add comments explaining pipe struct

b2026be

remove context from bufferedLedgerMetaReader

9e22e19

Remove unnecessary select

a90f250

2opremio approved these changes Dec 17, 2020

View reviewed changes

Merge branch 'master' into fix-captive-core-cleanup

faa82d7

tamirms merged commit 06a1b84 into stellar:master Dec 17, 2020

tamirms deleted the fix-captive-core-cleanup branch December 17, 2020 17:19

bartekn mentioned this pull request Dec 17, 2020

Helper for more complicated shutdown dependencies #3200

Closed

tamirms mentioned this pull request Dec 18, 2020

Consider moving optional context out of CaptiveCoreConfig and make it a parameter of PrepareRange #3295

Closed

bartekn mentioned this pull request Oct 20, 2021

ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

tamirms commented Dec 10, 2020 •

edited

Loading

bartekn left a comment

bartekn Dec 10, 2020

tamirms Dec 10, 2020

bartekn Dec 10, 2020

tamirms Dec 10, 2020

bartekn Dec 10, 2020

tamirms Dec 10, 2020

bartekn Dec 10, 2020

tamirms Dec 10, 2020

tamirms commented Dec 10, 2020

bartekn commented Dec 10, 2020

tamirms commented Dec 11, 2020

2opremio Dec 17, 2020

tamirms Dec 17, 2020

2opremio Dec 17, 2020

2opremio Dec 17, 2020

2opremio Dec 17, 2020

2opremio commented Dec 17, 2020

bartekn commented Dec 17, 2020 •

edited

Loading

ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

Conversation

tamirms commented Dec 10, 2020 • edited Loading

PR Structure

Thoroughness

Release planning

What

Known limitations

bartekn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamirms commented Dec 10, 2020

bartekn commented Dec 10, 2020

tamirms commented Dec 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2opremio commented Dec 17, 2020

bartekn commented Dec 17, 2020 • edited Loading

tamirms commented Dec 10, 2020 •

edited

Loading

bartekn commented Dec 17, 2020 •

edited

Loading