Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

Merged
merged 21 commits into from
Dec 17, 2020

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented Dec 10, 2020

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

This PR implements robust shutdown behavior in CaptiveStellarCore by embedding a context in stellarCoreRunner.

CaptiveStellarCore exposes the following functions:

  • PrepareRange()
  • IsPrepared()
  • GetLedger()
  • GetLatestLedgerSequence()
  • Close()

PrepareRange(), IsPrepared(), GetLedger(), and GetLatestLedgerSequence() cannot be called concurrently. Only one go routine should be calling those functions on a CaptiveStellarCore instance. If any of those functions are accessed concurrently then the behavior of CaptiveStellarCore will be undefined.

However, Close() can be called concurrently with any of the functions listed above. In other words, it is safe for Close() to interrupt any pending PrepareRange(), IsPrepared(), GetLedger(), or GetLatestLedgerSequence() operation. Once Close() is called, any pending operation (e.g. PrepareRange() / GetLedger()) is expected to terminate quickly. The Close() operation should also complete in a timely manner.

After Close() is called, the captive core subprocess should be reaped, the captive core named pipe should be closed, the temporary directory used by captive core should be deleted, and all go routines related to buffering captive core logs or streaming ledgers from the named pipe should be terminated.

Note that you can still use a CaptiveStellarCore instance after calling Close(). If you call PrepareRange() after calling Close() then CaptiveStellarCore will create a new captive core subprocess (along with the named pipe, buffering go routines, etc) and begin a new session. [Edit: the behavior of Close() is now terminal. Once it is called the CaptiveStellarCore instance will fail on all subsequent operations]

The CaptiveCoreConfig struct has been extended to include an optional Context field. The context in CaptiveCoreConfig acts as a parent context for all PrepareRange() sessions. In other words, if the parent context is cancelled then, any pending PrepareRange() / GetLedger() operation will be terminated even though Close() was not called. Furthermore, once the parent context is terminated the CaptiveStellarCore instance will effectively be dead (e.g. it will not be possible start any more PrepareRange() sessions).

The behavior described above is implemented by embedding a context in the stellarCoreRunner struct. stellarCoreRunner will pass that context into exec.CommandContext() when spawning the captive core process. When we want to abort the running stellarCoreRunner instance we cancel the context associated with stellarCoreRunner which will kill the captive core subprocess (by calling os.Process.Kill) if the stellar core has not already exited on its own (see https://golang.org/pkg/os/exec/#CommandContext ). By using exec.CommandContext() we are able to remove some of the cleanup code from stellarCoreRunner including the unix and windows implementations of processIsAlive().

Previously, I mentioned that CaptiveStellarCore.Close() is able to safely interrupt any pending CaptiveStellarCore operation. This invariant is guaranteed because CaptiveStellarCore uses a read write lock to protect access to stellarCoreRunner. While the read lock is held, CaptiveStellarCore cannot assign its stellarCoreRunner field to a new value. stellarCoreRunner.close() is also thread safe and idempotent (so it doesn't matter if CaptiveStellarCore.Close() is called multiple times).

Once stellarCoreRunner.close() is invoked it will cancel the stellarCoreRunner context and it will also close the channel used by CaptiveStellarCore to read ledgers from captive core. At this point any blocking calls to GetLedger() will be able to terminate quickly.

The code which determines how GetLedger() terminates on error is encapsulated in the checkMetaPipeResult() function.
There are basically 3 types of errors which can arise in GetLedger():

  1. User initiated shutdown by canceling the parent context or calling Close().
  2. The stellar core process exited unexpectedly.
  3. Some error was encountered while consuming the ledgers emitted by captive core (e.g. parsing invalid xdr)

All these cases are handled in checkMetaPipeResult().

Some other points to note while reading the PR:

  • the implementation of bufferedLedgerMetaReader is now more decoupled from CaptiveStellarCore and stellarCoreRunner
  • bufferedLedgerMetaReader is now created and owned by stellarCoreRunner. when stellarCoreRunner terminates, it will dispose of its bufferedLedgerMetaReader instance
  • CaptiveStellarCore does not have to listen to any exit channels. CaptiveStellarCore only needs to handle the case when the stellarCoreRunner meta pipe is exhausted.
  • In the previous code the go routine created by stellarCoreRunner to consume stdout and stderr from captive core was never terminated. This PR fixes the go routine leak.
  • The named pipe is now properly closed by stellarCoreRunner

Known limitations

I need to fix the tests. I plan to do so once I receive approval on the overall approach of the PR.

@cla-bot cla-bot bot added the cla: yes label Dec 10, 2020
@tamirms tamirms force-pushed the fix-captive-core-cleanup branch 7 times, most recently from 9f68992 to 19d5cd1 Compare December 10, 2020 12:46
@tamirms tamirms changed the title ingest/ledgerbackend: Use contexts to handle termination and cleanup of captive core ingest/ledgerbackend: Use context to handle termination and cleanup of captive core Dec 10, 2020
Copy link
Contributor

@bartekn bartekn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal. Here's a list of what I like and don't like. Also added some inline comments. Please note that I haven't run the code so I can be wrong below.

Like:

  • bufferedMetaPipeReader used in stellarCoreRunner directly. I didn't do that in the PR that introduced bufferedMetaPipeReader because I was thinking that we should keep it separate to give access to raw meta stream because maybe we'll use it in the future. But we can split it again when needed.

Don't like:

  • Looks like in order to properly close the backend you need to call Close() and cancel() the Context. This can be confusing because Context is not part of the LedgerBackend interface so it will be hard to reuse the code with different backend. It will be a problem even in simplest apps that want to close the backend on CMD+C.
  • I think that two shutdown signals, in general, create all kinds of new concurrency issues. One example: if we cancel context and call Close() it's possible that process exit error returned by r.cmd.Wait() will be killed instead of ErrCanceled because Close() code executed first (I added an inline comment with example when this can happen). We can add a bunch of code to enforce the order we want but tomb has a nice property of Kill: it doesn't overwrite the error on following calls to Kill so we'll still know the original error.
  • I added an inline comment to checkMetaPipeResult but wanted to discuss it here too. This is quite complicated because it has to check 3 different errors (stellarCoreRunner.context().Err(), result.err and getProcessExitError()) and ok value. Developer not only need to know the internals of stellarCoreRunner and bufferedMetaPipeReader but also the order in which events execute (context cancellation, channel close, etc.). This was really difficult for me to understand. In tomb you receive a single error (in our case it's nil, context.Cancelled or actual error) and using a single method (so no issues with potential concurrency issues I explained in the inline comment).
  • In general I don't like we're back to using using wait groups and closing channels which can be easily misused and were misued in the past: Reliability issues in remote-core http server #3228 Crash in captive-core's backend #3159 but not only in our project. tomb can protect us from this pretty well.

One extra note, when I was searching for solutions to this problem I came across: https://dave.cheney.net/2017/08/20/context-isnt-for-cancellation which I think it's worth checking. I don't fully agree with this statement because I think that context cancellation (by timeout/deadline/cancel) is a great way to handle HTTP request cancellation. But in cases when ack is needed and the final error can be overwritten along the way I think context is really hard to maintain.

return false, xdr.LedgerCloseMeta{}, errOut
}

func (c *CaptiveStellarCore) checkMetaPipeResult(result metaResult, ok bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible here that in case of context cancellation: c.stellarCoreRunner.context().Err() will be nil (only closed in close() which may not executed yet) but c.stellarCoreRunner.getProcessExitError() will return exited without error (likely killed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is a valid case. what's wrong with that scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want GetLedger or PrepareRange to return errors in case of cancellation (except error that specifies this was cancelled). killed error can confuse users that something went wrong and I guess we'll see it every once in a while when shutting down Horizon (we do handler context.Cancelled though and don't log it with error level).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Close() is called on a CaptiveStellarCore which is running GetLedger() or PrepareRange() the only possible error GetLedger() or PrepareRange() can return is context.Cancelled. There is no scenario where a killed error will be returned

Comment on lines 378 to 383
// drain meta pipe channel to make sure the ledger buffer goroutine exits
for range r.getMetaPipe() {

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't do that. I actually removed this in my PR from existing code because I realized that if GetLedger/PrepareRange is still running it may return ledgers out of order (because some are read in here). That's why this is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this is an issue. If we are Closing the captive core instance we don't care about reading the ledgers and we want to terminate as soon as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is done it two different routines. So if GetLedger will read ledgers out of order it will error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is done it two different routines. So if GetLedger will read ledgers out of order it will error.

This will only occur if we're Closing() captive core and in that scenario GetLedger() doesn't care about the ledgers and needs to exit asap

@tamirms tamirms force-pushed the fix-captive-core-cleanup branch 2 times, most recently from 08fb50e to 9596859 Compare December 10, 2020 13:41
@tamirms
Copy link
Contributor Author

tamirms commented Dec 10, 2020

@bartekn

Looks like in order to properly close the backend you need to call Close() and cancel() the Context. This can be confusing because Context is not part of the LedgerBackend interface so it will be hard to reuse the code with different backend. It will be a problem even in simplest apps that want to close the backend on CMD+C.

This isn't quite right. If you call Close() on a captive core instance it will terminate any pending PrepareRange / GetLedger operations. That is guaranteed. You can still create a new session but in order to do so you must call PrepareRange() after calling Close(). The benefit of the parent context is that if you cancel it, the captive core instance becomes completely disabled making it impossible to start any new sessions. This solves the issue I mentioned here #3258 (comment) .

I think that two shutdown signals, in general, create all kinds of new concurrency issues. One example: if we cancel context and call Close() it's possible that process exit error returned by r.cmd.Wait() will be killed instead of ErrCanceled because Close() code executed first (I added an inline comment with example when this can happen). We can add a bunch of code to enforce the order we want but tomb has a nice property of Kill: it doesn't overwrite the error on following calls to Kill so we'll still know the original error.

I think you misunderstood the semantics of processExitError. That variable is supposed to contain the process exit error as determined by cmd.Wait(). If you want to know if a captive core instance was terminated by Close() you can check the context error. In other words, context().Err() != nil if and only if Close() was called.

I added an inline comment to checkMetaPipeResult but wanted to discuss it here too. This is quite complicated because it has to check 3 different errors (stellarCoreRunner.context().Err(), result.err and getProcessExitError()) and ok value. Developer not only need to know the internals of stellarCoreRunner and bufferedMetaPipeReader but also the order in which events execute (context cancellation, channel close, etc.). This was really difficult for me to understand. In tomb you receive a single error (in our case it's nil, context.Cancelled or actual error) and using a single method (so no issues with potential concurrency issues I explained in the inline comment).

I think the comment about the complexity is a fair point. However, I think it's acceptable because this complexity is completely encapsulated within 1 function checkMetaPipeResult(). Also, stellarCoreRunner is not an exported type, it is coupled to CaptiveStellarCore. I view it as an internal type and part of the implementation details of CaptiveStellarCore. I can add some comments to explain the function. I don't agree with your comment about concurrency issues though.

In general I don't like we're back to using using wait groups and closing channels which can be easily misused and were misued in the past: #3228 #3159 but not only in our project. tomb can protect us from this pretty well.

I believe the issues we had in the past could be attributed to the difficulty of working with concurrency in general more so than the specific use of waitgroups or channels. In your PR where you introduced tomb we were still able to find a bunch of concurrency related bugs.

@tamirms tamirms requested a review from a team December 10, 2020 16:22
@bartekn
Copy link
Contributor

bartekn commented Dec 10, 2020

This isn't quite right. If you call Close() on a captive core instance it will terminate any pending PrepareRange / GetLedger operations. That is guaranteed. You can still create a new session but in order to do so you must call PrepareRange() after calling Close(). The benefit of the parent context is that if you cancel it, the captive core instance becomes completely disabled making it impossible to start any new sessions. This solves the issue I mentioned here #3258 (comment) .

OK, I understand now. I was also thinking about it after your comment in #3258 and I think what we could do is to make Close() actually close the ledger backend (any backend) so it's not usable anymore and you need to create a new one. I checked how we use it in Horizon and we don't really need to close it other than when closing the app and I believe this will be the same for other apps. I think this will make the interface much more clear and an extra Context won't be required.

I think you misunderstood the semantics of processExitError. That variable is supposed to contain the process exit error as determined by cmd.Wait(). If you want to know if a captive core instance was terminated by Close() you can check the context error. In other words, context().Err() != nil if and only if Close() was called.

OK, I can see it now after you explained what CaptiveStellarCore.Context is used for.

I don't agree with your comment about concurrency issues though.

Again, it's clear after CaptiveStellarCore.Context explanations. But it's more complicated compare to tomb in my opinion.

I believe the issues we had in the past could be attributed to the difficulty of working with concurrency in general more so than the specific use of waitgroups or channels. In your PR where you introduced tomb we were still able to find a bunch of concurrency related bugs.

Agreed! I meant: in my opinion it's much easier to make mistakes like closing channel twice when operating on channels directly compared to when using tomb because it's API protects from something like this. I'm not saying it's the case here but it may happen when we refactor the code in the future.

@tamirms
Copy link
Contributor Author

tamirms commented Dec 11, 2020

@bartekn I have updated the PR to make the context in CaptiveCoreConfig optional. Also, I have changed the behavior of CaptiveStellarCore.Close() to make it terminal. Once CaptiveStellarCore.Close() is invoked the CaptiveStellarCore instance will no longer be functional

@tamirms tamirms force-pushed the fix-captive-core-cleanup branch 2 times, most recently from 4591ffd to 45c199c Compare December 11, 2020 09:35
@tamirms tamirms marked this pull request as ready for review December 16, 2020 18:54
// Context is the (optional) context which controls the lifetime of a CaptiveStellarCore instance. Once the context is done
// the CaptiveStellarCore instance will not be able to stream ledgers from Stellar Core or spawn new
// instances of Stellar Core. If Context is omitted CaptiveStellarCore will default to using context.Background.
Context context.Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider moving the context so that it's a parameter to PrepareRange() instead. but i'd prefer implementing that change in a follow up PR if necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, please do.

if err != nil {
return errors.Wrap(err, "error closing existing session")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIIIIIICE!

if len(c.ledgerBuffer.getChannel()) > 0 {
break
}
time.Sleep(c.waitIntervalPrepareRange)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner without this. Great!

@2opremio
Copy link
Contributor

I think this is a great step forward. The code is way cleaner and (IMO) easier to understand.

I am weary of bugs being introduced due to the changes being considerable ... but that's what a beta is for.

I think we should try to do some release testing specifically for this PR.

@bartekn
Copy link
Contributor

bartekn commented Dec 17, 2020

I think we should try to do some release testing specifically for this PR.

Agree, the plan is to test the last commit when all the remaining PRs are merged (and then master is merged into the release branch). Hopefully this will happen today.

@tamirms tamirms merged commit 06a1b84 into stellar:master Dec 17, 2020
@tamirms tamirms deleted the fix-captive-core-cleanup branch December 17, 2020 17:19
bartekn pushed a commit to bartekn/go that referenced this pull request Jan 6, 2021
…f captive core (stellar#3278)

CaptiveStellarCore exposes the following functions:

* PrepareRange()
* IsPrepared()
* GetLedger()
* GetLatestLedgerSequence()
* Close()

PrepareRange(), IsPrepared(), GetLedger(), and GetLatestLedgerSequence() cannot be called concurrently. Only one go routine should be calling those functions on a CaptiveStellarCore instance. If any of those functions are accessed concurrently then the behavior of CaptiveStellarCore will be undefined.

However, Close() can be called concurrently with any of the functions listed above. In other words, it is safe for Close() to interrupt any pending PrepareRange(), IsPrepared(), GetLedger(), or GetLatestLedgerSequence() operation. Once Close() is called, any pending operation (e.g. PrepareRange() / GetLedger()) is expected to terminate quickly. The Close() operation should also complete in a timely manner.

After Close() is called, the captive core subprocess should be reaped, the captive core named pipe should be closed, the temporary directory used by captive core should be deleted, and all go routines related to buffering captive core logs or streaming ledgers from the named pipe should be terminated.

The behavior of Close() is now terminal. Once it is called the CaptiveStellarCore instance will fail on all subsequent operations.

The CaptiveCoreConfig struct has been extended to include an optional Context field. The context in CaptiveCoreConfig acts as a parent context for all PrepareRange() sessions. In other words, if the parent context is cancelled then, any pending PrepareRange() / GetLedger() operation will be terminated even though Close() was not called. Furthermore, once the parent context is terminated the CaptiveStellarCore instance will effectively be dead (e.g. it will not be possible start any more PrepareRange() sessions).
bartekn added a commit that referenced this pull request Oct 21, 2021
…evious instance termination (#4020)

Add a code to `CaptiveCoreBackend.startPreparingRange` that ensures that
previously started Stellar-Core instance is not running: check if
`getProcessExitError` returns `true` which means that Stellar-Core process is
fully terminated. This prevents a situation in which a new instance is started
and clashes with the previous one.

The existing code contains a bug, likely introduced in 0f2d08b. The context
returned by `stellarCoreRunner.context()` is cancelled in
`stellarCoreRunner.close()` which initiates the termination process. At the same
time, `CaptiveCoreBackend.PrepareRange()` calls `CaptiveCoreBackend.isClosed()`
internally that checks which return value depends on `stellarCoreRunner` context
being closed. This is wrong because it's possible that Stellar-Core is still not
closed even when aforementioned context is cancelled - it can be still closing
so the process can be still running.

Because of this the following chain of events can lead to two Stellar-Core
instances running (briefly) at the same time:

1. Stellar-Core instance is upgraded, triggering `fileWatcher` to call
   `stellarCoreRunner.close()` which cancels the `stellarCoreRunner.context()`.
2. In another Go routine, `CaptiveBackend.IsPrepared()` is called, which returns
   `false` because `stellarCoreRunner.context()` is canceled and then calls
   `CaptiveBackend.PrepareRange()` to restart Stellar-Core. `PrepareRange()`
   also checks if `stellarCoreRunner.context()` is cancelled (it is but
   Stellar-Core process can still run shutdown procedure) and then attempts to
   start a new instance.

This commit is really a quick fix. Code before 0f2d08b was simpler because it
was calling `Kill()` on a process so "terminating" and "terminated" were exactly
the same state. After 0f2d08b there are now two events associated with a
Stellar-Core process (as above). Because of this the code requires a larger
refactoring. We may reconsider using `tomb` package I tried in #3258 that was
later closed in favour of: #3278.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants