ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

bartekn · 2021-10-20T21:21:37Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

Add a code to CaptiveCoreBackend.startPreparingRange that ensures that previously started Stellar-Core instance: check if getProcessExitError returns true which means that Stellar-Core process is fully terminated. This prevents a situation in which a new instance is started and clashes with the previous one.

Why

The existing code contains a bug, likely introduced in 0f2d08b. The context returned by stellarCoreRunner.context() is cancelled in stellarCoreRunner.close() which initiates the termination process. At the same time CaptiveCoreBackend.PrepareRange() calls CaptiveCoreBackend.isClosed() internally that checks which return value depends on stellarCoreRunner context being closed. This is wrong because it's possible that Stellar-Core is still not closed even when aforementioned context is cancelled - it can be still closing so the process can be still running.

Because of this the following chain of events can lead to two Stellar-Core instances running (briefly) at the same time:

Stellar-Core instance is upgraded, triggering fileWatcher to call stellarCoreRunner.close() which cancels the stellarCoreRunner.context().
In another Go routine, CaptiveBackend.IsPrepared() is called, which returns false because stellarCoreRunner.context() is canceled and then calls CaptiveBackend.PrepareRange() to restart Stellar-Core. PrepareRange() also checks if stellarCoreRunner.context() is cancelled (it is but Stellar-Core process can still run shutdown procedure) and then attempts to start a new instance.

Known limitations

This commit is really a quick fix. Code before 0f2d08b was simpler because it was calling Kill() on a process so "terminating" and "terminated" were exactly the same state. After 0f2d08b there are now two events associated with a Stellar-Core process (as above). Because of this the code requires a larger refactoring. We may reconsider using tomb package I tried in #3258 that was later closed in favour of: #3278.

…evious instance termination

…nto captive-core-wait-for-close

2opremio · 2021-10-21T12:30:01Z

Can you retarget to the release branch?

bartekn added 4 commits October 20, 2021 23:13

ingest/ledgerbackend: Make sure Stellar-Core is not started before pr…

504d0e8

…evious instance termination

Add test

8aa2a52

Add comment

aade63c

Merge branch 'master' into captive-core-wait-for-close

736aced

bartekn marked this pull request as ready for review October 20, 2021 22:04

bartekn added 2 commits October 21, 2021 00:53

Fix

e151758

Merge branch 'captive-core-wait-for-close' of github.com:bartekn/go i…

9ee021a

…nto captive-core-wait-for-close

bartekn requested a review from a team October 21, 2021 10:34

2opremio approved these changes Oct 21, 2021

View reviewed changes

bartekn changed the base branch from master to release-horizon-v2.10.0 October 21, 2021 12:30

bartekn merged commit 330fb5f into stellar:release-horizon-v2.10.0 Oct 21, 2021

bartekn deleted the captive-core-wait-for-close branch October 21, 2021 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

bartekn commented Oct 20, 2021 •

edited

Loading

2opremio commented Oct 21, 2021

ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

Conversation

bartekn commented Oct 20, 2021 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

2opremio commented Oct 21, 2021

bartekn commented Oct 20, 2021 •

edited

Loading