ingest/ledgerbackend: Captive-Core fixes to support Stellar-Core 17.1.0 #3694

bartekn · 2021-06-16T11:34:52Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

Multiple fixes to support Captive-Core persistent storage in 17.1.0:

Switched from exec.CommandContext to exec.Command when starting Stellar-Core. This was done because CommandContext kills the process (Process.Kill()) without waiting for a graceful shutdown. This can corrupt minimal DB or buckets. We send signal.Interrupt right now so Stellar-Core can exit gracefully. There's one gotcha: signal.Interrupt is not supported in Windows so we kill the process and empty storage folder that can contain corrupted data.
Removed DISABLE_XDR_FSYNC setting in stellar-core.cfg. This can corrupt buckets when Stellar-Core is killed.
Changed the behaviour of CaptiveStellarCore.nextLedger. Previously it was set to a value calculated using CaptiveStellarCore.runFromParams however these calculations are no longer correct when Stellar-Core is restarted with persistent storage. To prevent some Stellar-Core version checks we now set CaptiveStellarCore.nextLedger after the first ledger is streamed.
Reverted the code creating and removing storage directory on Windows.

Why

Stellar-Core 17.1.0 now persists the minimal DB and buckets between executions. It allows faster catchup on restart.

Known limitations

[TODO or N/A]

ingest/ledgerbackend/stellar_core_runner.go

tamirms · 2021-06-16T12:25:45Z

ingest/ledgerbackend/stellar_core_runner.go

+	if runtime.GOOS == "windows" {
+		// It's impossible to send SIGINT on Windows so buckets can become
+		// corrupted. If we can't reuse it, then remove it.
+		return os.RemoveAll(storagePath)


if the buckets are corrupted what happens if we don't remove the directory? will captive core be unable to start at all? should we also remove the directory on linux in the scenario that captive core does not shutdown gracefully and we have to use sigkill?

Good point. I believe the change in 3d28e9e should fix it (remove folder if there was an error terminating the process).

I don't see how r.processExitError != nil implies that the process must have been terminated by sigkill. Isn't the scenario below possible?

context is cancelled

we send sigint to captive core

captive core terminates cleanly before the 10 second timeout (we don't need to send sigkill)

r.processExitError is assigned the context error which is non nil

I forgot about context.Cancelled - does 3200baf look good now?

@bartekn I ran it locally and it seems to work. it might be worth adding an assertion in the integration tests here:

https://github.com/stellar/go/blob/master/services/horizon/internal/test/integration/integration.go#L252

if we're running the integration tests on windows with captive core then we expect that the buckets directory to still exist after horizon has shut down

ingest/ledgerbackend/captive_core_backend.go

paulbellamy · 2021-06-17T14:09:14Z

... without waiting for a graceful shutdown. This can corrupt ...

Isn't that an issue if it gets OOM-killed, or powered-off too?

bartekn · 2021-06-17T14:14:36Z

Isn't that an issue if it gets OOM-killed, or powered-off too?

Correct but I believe there's nothing we can other than, maybe, creating a lock file as Stellar-Core does. But it will still not help with data corruption. In such case storage dir need to be removed manually.

ingest/ledgerbackend: Captive-Core fixes to support Stellar-Core 17.1.0

93857c9

bartekn requested a review from a team June 16, 2021 11:34

bartekn mentioned this pull request Jun 16, 2021

Ledger: processFeesSeqNums error @ 1 : Unexpected database state (Version: 17.1.0) stellar/stellar-core#3085

Closed

bartekn added 2 commits June 16, 2021 14:03

Use waitOrStop pattern

666b1e5

fix

ca4365d