-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278
Conversation
9f68992
to
19d5cd1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal. Here's a list of what I like and don't like. Also added some inline comments. Please note that I haven't run the code so I can be wrong below.
Like:
bufferedMetaPipeReader
used instellarCoreRunner
directly. I didn't do that in the PR that introducedbufferedMetaPipeReader
because I was thinking that we should keep it separate to give access to raw meta stream because maybe we'll use it in the future. But we can split it again when needed.
Don't like:
- Looks like in order to properly close the backend you need to call
Close()
andcancel()
theContext
. This can be confusing becauseContext
is not part of theLedgerBackend
interface so it will be hard to reuse the code with different backend. It will be a problem even in simplest apps that want to close the backend on CMD+C. - I think that two shutdown signals, in general, create all kinds of new concurrency issues. One example: if we cancel context and call
Close()
it's possible that process exit error returned byr.cmd.Wait()
will bekilled
instead ofErrCanceled
becauseClose()
code executed first (I added an inline comment with example when this can happen). We can add a bunch of code to enforce the order we want buttomb
has a nice property ofKill
: it doesn't overwrite the error on following calls toKill
so we'll still know the original error. - I added an inline comment to
checkMetaPipeResult
but wanted to discuss it here too. This is quite complicated because it has to check 3 different errors (stellarCoreRunner.context().Err()
,result.err
andgetProcessExitError()
) andok
value. Developer not only need to know the internals ofstellarCoreRunner
andbufferedMetaPipeReader
but also the order in which events execute (context cancellation, channel close, etc.). This was really difficult for me to understand. Intomb
you receive a single error (in our case it'snil
,context.Cancelled
or actual error) and using a single method (so no issues with potential concurrency issues I explained in the inline comment). - In general I don't like we're back to using using wait groups and closing channels which can be easily misused and were misued in the past: Reliability issues in remote-core http server #3228 Crash in captive-core's backend #3159 but not only in our project.
tomb
can protect us from this pretty well.
One extra note, when I was searching for solutions to this problem I came across: https://dave.cheney.net/2017/08/20/context-isnt-for-cancellation which I think it's worth checking. I don't fully agree with this statement because I think that context cancellation (by timeout/deadline/cancel) is a great way to handle HTTP request cancellation. But in cases when ack is needed and the final error can be overwritten along the way I think context is really hard to maintain.
return false, xdr.LedgerCloseMeta{}, errOut | ||
} | ||
|
||
func (c *CaptiveStellarCore) checkMetaPipeResult(result metaResult, ok bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible here that in case of context cancellation: c.stellarCoreRunner.context().Err()
will be nil (only closed in close()
which may not executed yet) but c.stellarCoreRunner.getProcessExitError()
will return exited without error (likely killed
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is a valid case. what's wrong with that scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't want GetLedger
or PrepareRange
to return errors in case of cancellation (except error that specifies this was cancelled). killed
error can confuse users that something went wrong and I guess we'll see it every once in a while when shutting down Horizon (we do handler context.Cancelled
though and don't log it with error
level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Close()
is called on a CaptiveStellarCore
which is running GetLedger()
or PrepareRange()
the only possible error GetLedger()
or PrepareRange()
can return is context.Cancelled
. There is no scenario where a killed
error will be returned
// drain meta pipe channel to make sure the ledger buffer goroutine exits | ||
for range r.getMetaPipe() { | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't do that. I actually removed this in my PR from existing code because I realized that if GetLedger
/PrepareRange
is still running it may return ledgers out of order (because some are read in here). That's why this is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this is an issue. If we are Closing the captive core instance we don't care about reading the ledgers and we want to terminate as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is done it two different routines. So if GetLedger
will read ledgers out of order it will error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is done it two different routines. So if GetLedger will read ledgers out of order it will error.
This will only occur if we're Closing()
captive core and in that scenario GetLedger()
doesn't care about the ledgers and needs to exit asap
08fb50e
to
9596859
Compare
This isn't quite right. If you call
I think you misunderstood the semantics of
I think the comment about the complexity is a fair point. However, I think it's acceptable because this complexity is completely encapsulated within 1 function
I believe the issues we had in the past could be attributed to the difficulty of working with concurrency in general more so than the specific use of waitgroups or channels. In your PR where you introduced tomb we were still able to find a bunch of concurrency related bugs. |
OK, I understand now. I was also thinking about it after your comment in #3258 and I think what we could do is to make
OK, I can see it now after you explained what
Again, it's clear after
Agreed! I meant: in my opinion it's much easier to make mistakes like closing channel twice when operating on channels directly compared to when using |
@bartekn I have updated the PR to make the context in |
4591ffd
to
45c199c
Compare
45c199c
to
79df8a6
Compare
08005ef
to
6a61fad
Compare
// Context is the (optional) context which controls the lifetime of a CaptiveStellarCore instance. Once the context is done | ||
// the CaptiveStellarCore instance will not be able to stream ledgers from Stellar Core or spawn new | ||
// instances of Stellar Core. If Context is omitted CaptiveStellarCore will default to using context.Background. | ||
Context context.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could consider moving the context so that it's a parameter to PrepareRange()
instead. but i'd prefer implementing that change in a follow up PR if necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, please do.
if err != nil { | ||
return errors.Wrap(err, "error closing existing session") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIIIIIICE!
if len(c.ledgerBuffer.getChannel()) > 0 { | ||
break | ||
} | ||
time.Sleep(c.waitIntervalPrepareRange) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner without this. Great!
I think this is a great step forward. The code is way cleaner and (IMO) easier to understand. I am weary of bugs being introduced due to the changes being considerable ... but that's what a beta is for. I think we should try to do some release testing specifically for this PR. |
Agree, the plan is to test the last commit when all the remaining PRs are merged (and then master is merged into the release branch). Hopefully this will happen today. |
…f captive core (stellar#3278) CaptiveStellarCore exposes the following functions: * PrepareRange() * IsPrepared() * GetLedger() * GetLatestLedgerSequence() * Close() PrepareRange(), IsPrepared(), GetLedger(), and GetLatestLedgerSequence() cannot be called concurrently. Only one go routine should be calling those functions on a CaptiveStellarCore instance. If any of those functions are accessed concurrently then the behavior of CaptiveStellarCore will be undefined. However, Close() can be called concurrently with any of the functions listed above. In other words, it is safe for Close() to interrupt any pending PrepareRange(), IsPrepared(), GetLedger(), or GetLatestLedgerSequence() operation. Once Close() is called, any pending operation (e.g. PrepareRange() / GetLedger()) is expected to terminate quickly. The Close() operation should also complete in a timely manner. After Close() is called, the captive core subprocess should be reaped, the captive core named pipe should be closed, the temporary directory used by captive core should be deleted, and all go routines related to buffering captive core logs or streaming ledgers from the named pipe should be terminated. The behavior of Close() is now terminal. Once it is called the CaptiveStellarCore instance will fail on all subsequent operations. The CaptiveCoreConfig struct has been extended to include an optional Context field. The context in CaptiveCoreConfig acts as a parent context for all PrepareRange() sessions. In other words, if the parent context is cancelled then, any pending PrepareRange() / GetLedger() operation will be terminated even though Close() was not called. Furthermore, once the parent context is terminated the CaptiveStellarCore instance will effectively be dead (e.g. it will not be possible start any more PrepareRange() sessions).
…evious instance termination (#4020) Add a code to `CaptiveCoreBackend.startPreparingRange` that ensures that previously started Stellar-Core instance is not running: check if `getProcessExitError` returns `true` which means that Stellar-Core process is fully terminated. This prevents a situation in which a new instance is started and clashes with the previous one. The existing code contains a bug, likely introduced in 0f2d08b. The context returned by `stellarCoreRunner.context()` is cancelled in `stellarCoreRunner.close()` which initiates the termination process. At the same time, `CaptiveCoreBackend.PrepareRange()` calls `CaptiveCoreBackend.isClosed()` internally that checks which return value depends on `stellarCoreRunner` context being closed. This is wrong because it's possible that Stellar-Core is still not closed even when aforementioned context is cancelled - it can be still closing so the process can be still running. Because of this the following chain of events can lead to two Stellar-Core instances running (briefly) at the same time: 1. Stellar-Core instance is upgraded, triggering `fileWatcher` to call `stellarCoreRunner.close()` which cancels the `stellarCoreRunner.context()`. 2. In another Go routine, `CaptiveBackend.IsPrepared()` is called, which returns `false` because `stellarCoreRunner.context()` is canceled and then calls `CaptiveBackend.PrepareRange()` to restart Stellar-Core. `PrepareRange()` also checks if `stellarCoreRunner.context()` is cancelled (it is but Stellar-Core process can still run shutdown procedure) and then attempts to start a new instance. This commit is really a quick fix. Code before 0f2d08b was simpler because it was calling `Kill()` on a process so "terminating" and "terminated" were exactly the same state. After 0f2d08b there are now two events associated with a Stellar-Core process (as above). Because of this the code requires a larger refactoring. We may reconsider using `tomb` package I tried in #3258 that was later closed in favour of: #3278.
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
This PR implements robust shutdown behavior in
CaptiveStellarCore
by embedding a context instellarCoreRunner
.CaptiveStellarCore
exposes the following functions:PrepareRange()
IsPrepared()
GetLedger()
GetLatestLedgerSequence()
Close()
PrepareRange()
,IsPrepared()
,GetLedger()
, andGetLatestLedgerSequence()
cannot be called concurrently. Only one go routine should be calling those functions on aCaptiveStellarCore
instance. If any of those functions are accessed concurrently then the behavior ofCaptiveStellarCore
will be undefined.However,
Close()
can be called concurrently with any of the functions listed above. In other words, it is safe forClose()
to interrupt any pendingPrepareRange()
,IsPrepared()
,GetLedger()
, orGetLatestLedgerSequence()
operation. OnceClose()
is called, any pending operation (e.g.PrepareRange()
/GetLedger()
) is expected to terminate quickly. TheClose()
operation should also complete in a timely manner.After
Close()
is called, the captive core subprocess should be reaped, the captive core named pipe should be closed, the temporary directory used by captive core should be deleted, and all go routines related to buffering captive core logs or streaming ledgers from the named pipe should be terminated.Note that you can still use a[Edit: the behavior ofCaptiveStellarCore
instance after callingClose()
. If you callPrepareRange()
after callingClose()
thenCaptiveStellarCore
will create a new captive core subprocess (along with the named pipe, buffering go routines, etc) and begin a new session.Close()
is now terminal. Once it is called theCaptiveStellarCore
instance will fail on all subsequent operations]The
CaptiveCoreConfig
struct has been extended to include an optionalContext
field. The context inCaptiveCoreConfig
acts as a parent context for allPrepareRange()
sessions. In other words, if the parent context is cancelled then, any pendingPrepareRange()
/GetLedger()
operation will be terminated even thoughClose()
was not called. Furthermore, once the parent context is terminated theCaptiveStellarCore
instance will effectively be dead (e.g. it will not be possible start any morePrepareRange()
sessions).The behavior described above is implemented by embedding a context in the
stellarCoreRunner
struct.stellarCoreRunner
will pass that context intoexec.CommandContext()
when spawning the captive core process. When we want to abort the runningstellarCoreRunner
instance we cancel the context associated withstellarCoreRunner
which will kill the captive core subprocess (by calling os.Process.Kill) if the stellar core has not already exited on its own (see https://golang.org/pkg/os/exec/#CommandContext ). By usingexec.CommandContext()
we are able to remove some of the cleanup code fromstellarCoreRunner
including the unix and windows implementations ofprocessIsAlive()
.Previously, I mentioned that
CaptiveStellarCore.Close()
is able to safely interrupt any pendingCaptiveStellarCore
operation. This invariant is guaranteed becauseCaptiveStellarCore
uses a read write lock to protect access tostellarCoreRunner
. While the read lock is held,CaptiveStellarCore
cannot assign itsstellarCoreRunner
field to a new value.stellarCoreRunner.close()
is also thread safe and idempotent (so it doesn't matter ifCaptiveStellarCore.Close()
is called multiple times).Once
stellarCoreRunner.close()
is invoked it will cancel thestellarCoreRunner
context and it will also close the channel used byCaptiveStellarCore
to read ledgers from captive core. At this point any blocking calls toGetLedger()
will be able to terminate quickly.The code which determines how
GetLedger()
terminates on error is encapsulated in thecheckMetaPipeResult()
function.There are basically 3 types of errors which can arise in
GetLedger()
:All these cases are handled in
checkMetaPipeResult()
.Some other points to note while reading the PR:
bufferedLedgerMetaReader
is now more decoupled fromCaptiveStellarCore
andstellarCoreRunner
bufferedLedgerMetaReader
is now created and owned bystellarCoreRunner
. whenstellarCoreRunner
terminates, it will dispose of itsbufferedLedgerMetaReader
instanceCaptiveStellarCore
does not have to listen to any exit channels.CaptiveStellarCore
only needs to handle the case when thestellarCoreRunner
meta pipe is exhausted.stellarCoreRunner
to consume stdout and stderr from captive core was never terminated. This PR fixes the go routine leak.stellarCoreRunner
Known limitations
I need to fix the tests. I plan to do so once I receive approval on the overall approach of the PR.