-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest/ledgerbackend: Captive-Core fixes to support Stellar-Core 17.1.0 #3694
ingest/ledgerbackend: Captive-Core fixes to support Stellar-Core 17.1.0 #3694
Conversation
if runtime.GOOS == "windows" { | ||
// It's impossible to send SIGINT on Windows so buckets can become | ||
// corrupted. If we can't reuse it, then remove it. | ||
return os.RemoveAll(storagePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the buckets are corrupted what happens if we don't remove the directory? will captive core be unable to start at all? should we also remove the directory on linux in the scenario that captive core does not shutdown gracefully and we have to use sigkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I believe the change in 3d28e9e should fix it (remove folder if there was an error terminating the process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how r.processExitError != nil
implies that the process must have been terminated by sigkill. Isn't the scenario below possible?
- context is cancelled
- we send sigint to captive core
- captive core terminates cleanly before the 10 second timeout (we don't need to send sigkill)
r.processExitError
is assigned the context error which is non nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot about context.Cancelled
- does 3200baf look good now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartekn I ran it locally and it seems to work. it might be worth adding an assertion in the integration tests here:
if we're running the integration tests on windows with captive core then we expect that the buckets directory to still exist after horizon has shut down
Isn't that an issue if it gets OOM-killed, or powered-off too? |
Correct but I believe there's nothing we can other than, maybe, creating a lock file as Stellar-Core does. But it will still not help with data corruption. In such case storage dir need to be removed manually. |
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
Multiple fixes to support Captive-Core persistent storage in 17.1.0:
exec.CommandContext
toexec.Command
when starting Stellar-Core. This was done becauseCommandContext
kills the process (Process.Kill()
) without waiting for a graceful shutdown. This can corrupt minimal DB or buckets. We sendsignal.Interrupt
right now so Stellar-Core can exit gracefully. There's one gotcha:signal.Interrupt
is not supported in Windows so we kill the process and empty storage folder that can contain corrupted data.DISABLE_XDR_FSYNC
setting instellar-core.cfg
. This can corrupt buckets when Stellar-Core is killed.CaptiveStellarCore.nextLedger
. Previously it was set to a value calculated usingCaptiveStellarCore.runFromParams
however these calculations are no longer correct when Stellar-Core is restarted with persistent storage. To prevent some Stellar-Core version checks we now setCaptiveStellarCore.nextLedger
after the first ledger is streamed.Why
Stellar-Core 17.1.0 now persists the minimal DB and buckets between executions. It allows faster catchup on restart.
Known limitations
[TODO or N/A]