Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

Merged

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Oct 20, 2021

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Add a code to CaptiveCoreBackend.startPreparingRange that ensures that previously started Stellar-Core instance: check if getProcessExitError returns true which means that Stellar-Core process is fully terminated. This prevents a situation in which a new instance is started and clashes with the previous one.

Why

The existing code contains a bug, likely introduced in 0f2d08b. The context returned by stellarCoreRunner.context() is cancelled in stellarCoreRunner.close() which initiates the termination process. At the same time CaptiveCoreBackend.PrepareRange() calls CaptiveCoreBackend.isClosed() internally that checks which return value depends on stellarCoreRunner context being closed. This is wrong because it's possible that Stellar-Core is still not closed even when aforementioned context is cancelled - it can be still closing so the process can be still running.

Because of this the following chain of events can lead to two Stellar-Core instances running (briefly) at the same time:

  1. Stellar-Core instance is upgraded, triggering fileWatcher to call stellarCoreRunner.close() which cancels the stellarCoreRunner.context().
  2. In another Go routine, CaptiveBackend.IsPrepared() is called, which returns false because stellarCoreRunner.context() is canceled and then calls CaptiveBackend.PrepareRange() to restart Stellar-Core. PrepareRange() also checks if stellarCoreRunner.context() is cancelled (it is but Stellar-Core process can still run shutdown procedure) and then attempts to start a new instance.

Known limitations

This commit is really a quick fix. Code before 0f2d08b was simpler because it was calling Kill() on a process so "terminating" and "terminated" were exactly the same state. After 0f2d08b there are now two events associated with a Stellar-Core process (as above). Because of this the code requires a larger refactoring. We may reconsider using tomb package I tried in #3258 that was later closed in favour of: #3278.

@bartekn bartekn marked this pull request as ready for review October 20, 2021 22:04
@bartekn bartekn requested a review from a team October 21, 2021 10:34
@2opremio
Copy link
Contributor

Can you retarget to the release branch?

@bartekn bartekn changed the base branch from master to release-horizon-v2.10.0 October 21, 2021 12:30
@bartekn bartekn merged commit 330fb5f into stellar:release-horizon-v2.10.0 Oct 21, 2021
@bartekn bartekn deleted the captive-core-wait-for-close branch October 21, 2021 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants