Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp/ingest/pipeline: Fix pipeline data race during shutdown #2058

Merged

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Dec 13, 2019

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

This commit fixes data race in exp/ingest/pipeline that can occur when LiveSession (and Horizon) is shut down.

It also removes updateStats method that was known to have a data race (see comment in that method). It is not actively used right now but was being reported by race detector.

Fix #2046.

Why

Previous code handling shutdown signal in LiveSession can be found below:

errChan := s.LedgerPipeline.Process(ledgerReader)
select {
case err2 := <-errChan:
if err2 != nil {
// Return with no errors if pipeline shutdown
if err2 == pipeline.ErrShutdown {
s.LedgerReporter.OnEndLedger(nil, true)
return nil
}
if s.LedgerReporter != nil {
s.LedgerReporter.OnEndLedger(err2, false)
}
return errors.Wrap(err2, "Ledger pipeline errored")
}
case <-s.standardSession.shutdown:
if s.LedgerReporter != nil {
s.LedgerReporter.OnEndLedger(nil, true)
}
s.LedgerPipeline.Shutdown()
return nil
}

The problem is when shutdown signal is received, Resume returns nil so Horizon starts it's shutdown code which calls Rollback() (using internal tx object) but at the same time pipeline is still running until the code receiving from ctx.Done channel is executed. It means that pipeline processors can execute transactions using tx transaction object in DB session. See #2046 for examples.

To fix this:

  1. We don't select ingest session shutdown signal when waiting for pipeline to finish processing.
  2. Instead we call Shutdown on pipelines inside LiveSession.Shutdown.
  3. Then we wait/block until pipelines gracefully shutdown by calling Pipeline.IsRunning method.
  4. Finally we close(s.shutdown) inside expingest/System.Shutdown().

So the components now shut down exactly in the following order:

  1. Pipelines.
  2. Session.
  3. Horizon Expingest System.

One comment on -1 change in tests. When ingestSession.Run() returns nil we shouldn't continue to ingestSession.Resume() because nil value means that session ended. I updated the comment in LiveSession and also fixed Horizon code.

Known limitations

Pipeline design is very powerful but it's also very easy to introduce data races like this one. We may want to refactor this as noted previously in #2050.

Copy link
Contributor

@abuiles abuiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -107,6 +110,14 @@ func (s *LiveSession) updateCursor(ledgerSequence uint32) error {
func (s *LiveSession) Resume(ledgerSequence uint32) error {
s.standardSession.shutdown = make(chan bool)

err := s.validate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartekn did we forget to add this before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are two entry points to LiveSession: Run or Resume.

@bartekn bartekn merged commit 837b12c into stellar:release-horizon-v0.24.1 Dec 16, 2019
@bartekn bartekn deleted the fix-pipeline-data-race branch December 16, 2019 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants