Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: Captive-Core fixes to support Stellar-Core 17.1.0 #3694

Merged

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Jun 16, 2021

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Multiple fixes to support Captive-Core persistent storage in 17.1.0:

  • Switched from exec.CommandContext to exec.Command when starting Stellar-Core. This was done because CommandContext kills the process (Process.Kill()) without waiting for a graceful shutdown. This can corrupt minimal DB or buckets. We send signal.Interrupt right now so Stellar-Core can exit gracefully. There's one gotcha: signal.Interrupt is not supported in Windows so we kill the process and empty storage folder that can contain corrupted data.
  • Removed DISABLE_XDR_FSYNC setting in stellar-core.cfg. This can corrupt buckets when Stellar-Core is killed.
  • Changed the behaviour of CaptiveStellarCore.nextLedger. Previously it was set to a value calculated using CaptiveStellarCore.runFromParams however these calculations are no longer correct when Stellar-Core is restarted with persistent storage. To prevent some Stellar-Core version checks we now set CaptiveStellarCore.nextLedger after the first ledger is streamed.
  • Reverted the code creating and removing storage directory on Windows.

Why

Stellar-Core 17.1.0 now persists the minimal DB and buckets between executions. It allows faster catchup on restart.

Known limitations

[TODO or N/A]

if runtime.GOOS == "windows" {
// It's impossible to send SIGINT on Windows so buckets can become
// corrupted. If we can't reuse it, then remove it.
return os.RemoveAll(storagePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the buckets are corrupted what happens if we don't remove the directory? will captive core be unable to start at all? should we also remove the directory on linux in the scenario that captive core does not shutdown gracefully and we have to use sigkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I believe the change in 3d28e9e should fix it (remove folder if there was an error terminating the process).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how r.processExitError != nil implies that the process must have been terminated by sigkill. Isn't the scenario below possible?

  1. context is cancelled
  2. we send sigint to captive core
  3. captive core terminates cleanly before the 10 second timeout (we don't need to send sigkill)
  4. r.processExitError is assigned the context error which is non nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about context.Cancelled - does 3200baf look good now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartekn I ran it locally and it seems to work. it might be worth adding an assertion in the integration tests here:

https://github.com/stellar/go/blob/master/services/horizon/internal/test/integration/integration.go#L252

if we're running the integration tests on windows with captive core then we expect that the buckets directory to still exist after horizon has shut down

@bartekn bartekn marked this pull request as ready for review June 16, 2021 17:05
@bartekn bartekn added this to the Horizon 2.5.0 milestone Jun 16, 2021
@paulbellamy
Copy link
Contributor

... without waiting for a graceful shutdown. This can corrupt ...

Isn't that an issue if it gets OOM-killed, or powered-off too?

@bartekn
Copy link
Contributor Author

bartekn commented Jun 17, 2021

Isn't that an issue if it gets OOM-killed, or powered-off too?

Correct but I believe there's nothing we can other than, maybe, creating a lock file as Stellar-Core does. But it will still not help with data corruption. In such case storage dir need to be removed manually.

@bartekn bartekn merged commit 0f2d08b into stellar:release-horizon-v2.5.0 Jun 17, 2021
@bartekn bartekn deleted the captive-core-17-1-fixes branch June 17, 2021 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants