Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: Removed LedgerBackend calls inside ingest critical section #3518

Merged
merged 18 commits into from
Apr 12, 2021

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Mar 30, 2021

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

This commit moves calls to LedgerBackend in buildState and resumeState outside critical section which is a DB transaction acquiring ingestion lock. To clean the code LedgerBackend.GetLedgerBlocking method was introduced that always blocks until the ledger is available in the backend. Additionally, GetLatestLedger method name was changed to GetLatestHistoryLedger for clarity and ProcessorRunner methods have changed and do not depend on LedgerBackend anymore.

Close #3517.

Why

In case of access to LedgerBackend hangs (ex. broken connection to Core DB or remote captive core) inside a critical section ingestion will halt in entire cluster. With this change only a single ingesting node will halt.

Known limitations

All of the other states that are reading from LedgerBackend were not changed (still access it in critical section). This is due to the fact that multiple ledgers are read (ex. historyRangeState) and they may not fit into memory.

@@ -570,7 +630,16 @@ func (h historyRangeState) run(s *system) (transition, error) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still want to use s.maybePrepareRange() in this state? if not we can remove the maybePrepareRange() function entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately we have to keep access to LedgerBackend inside db transaction in this state. Alternatively, we could load all ledgers in a range into memory but for larger ranges it can cause OOM.


// getLedgerFromBackend gets ledger from the backend. If range is not prepared
// it will prepare it.
func (s *system) getLedgerFromBackend(sequence uint32) (bool, *xdr.LedgerCloseMeta, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read getLedgerFromBackend() I thought it was strange that we had a wrapper around ledgerBackend.GetLedger(). I think if we changed the function signature to the one below it would reduce confusion.

func prepareAndGet(ledgerBackend  ledgerbackend.LedgerBackend, sequence uint32) (bool, *xdr.LedgerCloseMeta, error) {
}

return false, &xdr.LedgerCloseMeta{}, nil
}

return s.ledgerBackend.GetLedger(sequence)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be too much to include in this PR but I think we should consider changing the LedgerBackend API so that we can add an argument to GetLedger() which configures whether the call should be blocking or non blocking. Currently, the blocking behavior of GetLedger() is configured implicitly depending on whether you call PrepareRange with a bounded or unbounded range. Which makes you have to internalize implementation details of CaptiveStellarCore to understand how GetLedger() will behave

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the blocking configuration passed into GetLedger() would also allow us to remove the fast-forward logic from buildState and resume. because we could change return s.ledgerBackend.GetLedger(sequence) so that it blocks until it finds ledger sequence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid point. I will try to do it in this PR.

// available in the backend (even for UnboundedRange).
// Please note that requesting a ledger sequence far after current ledger will
// block the execution for a long time.
func (c *CaptiveStellarCore) GetLedgerBlocking(sequence uint32) (xdr.LedgerCloseMeta, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we have this function, should we make GetLedger() always be non blocking?

Copy link
Contributor

@paulbellamy paulbellamy Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks like it's not really threadsafe with GetLedger(), given it modifies the underlying thing. Wouldn't "the go way" be to have the default way be blocking, and use a goroutine/channel when you want it async? (When do we want it async, anyway?)

Edit: Ah, GetLedger(), is more like MaybeGetLedger() in that it might not exist in the backend yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we probably should make GetLedger blocking by default but before doing this I think we should 1) ensure it doesn't cause unexpected issues (we should be able to tell after testing Horizon release with this PR), 2) we are modifying stable ingest package so we should do it via major release. Going to create an issue about it soon.

@paulbellamy
Copy link
Contributor

To some extent, it seems like the code around fsm.go, and the processor mix up concerns around "How do I get ledgers", and "What do I do with the data". Both are combined in the history.IngestionQ, which leads the code to mix together their usage. Splitting the ledger fetching methods out of history.IngestionQ might help.

I guess the structure I expected to see was something more like: LedgerReader -> IngestionProcessor -> Database, then with a top-level loop something like:

db := NewDB()
processor := NewIngestionProcessor(db)
for ledgers.Next() {
  ledger, err := ledgers.Get()
  err = processor.Ingest(ledger) // or Verify(ledger)
}

@bartekn
Copy link
Contributor Author

bartekn commented Apr 2, 2021

@paulbellamy I'm not sure I understand.

seems like the code around fsm.go, and the processor mix up concerns around "How do I get ledgers", and "What do I do with the data".

I think "How do I get ledgers" concern was really mixed before that PR because ProcessorRunner called GetLedger of the backend to get ledger data. This PR actually solves this issue, due to different reasons but it seems to be solved. Right now we get a single ledger data in fsm.go and pass it to ProcessorRunner to process.

I guess the structure I expected to see was something more like: LedgerReader -> IngestionProcessor -> Database, then with a top-level loop something like:

The code you shared was actually how the previous ingesting system was running (pre 1.0.0) and there was nothing wrong with the main loop in particular. However, because we added state ingestion (from history buckets) and distributed ingestion it complicated this a lot: right now it's possible that when Horizon is ingesting ledgers one-by-one it may escape to another FSM state, ex. to rebuild ledger state because of the upgrade etc. OR it may need to skip one or more ledger ingested by another machine.

Also, the FSM was designed when non-blocking GetLedger was used (and non-blocking version was used because back then there was no captive core). I think that if we are happy with the blocking version we may want to rewrite resumeState to be a loop similar to your snippet. But the rest of the states will need to stay like this.

Please elaborate, maybe I'm missing something.

@paulbellamy
Copy link
Contributor

paulbellamy commented Apr 2, 2021 via email

@bartekn
Copy link
Contributor Author

bartekn commented Apr 2, 2021

Awesome!

@bartekn bartekn marked this pull request as ready for review April 2, 2021 16:15
@bartekn bartekn requested a review from a team April 2, 2021 16:15
@bartekn
Copy link
Contributor Author

bartekn commented Apr 8, 2021

@tamirms FYI I changed GetLedger to GetLedgerBlocking in historyRangeState: 1a06520.

@@ -644,7 +655,16 @@ func (h reingestHistoryRangeState) ingestRange(s *system, fromLedger, toLedger u
}

for cur := fromLedger; cur <= toLedger; cur++ {
if err = runTransactionProcessorsOnLedger(s, cur); err != nil {
exists, ledgerCloseMeta, err := s.ledgerBackend.GetLedger(cur)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also be blocking? I think it should be blocking for captive core. But, when using a db ledger backend I guess we expect that the ledger should already be in the db. If that reasoning is correct we could just add a comment to make this blocking once we remove the db ledger backend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to use GetLedgerBlocking because we prepare a bounded range in reingestHistoryRangeState which will make CC blocking. But I agree, we we can make it blocking explicitly after updating ledger backend GetLedger to be always blocking (and when we remove DB backend). Not adding a comment because when above is done, it will not be possible to get ledgers in a non-blocking way.

Copy link
Contributor

@tamirms tamirms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in a future PR we should make captive core always block on GetLedger() because I can't think of any cases in fsm.go where want it to be non-blocking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move LedgerBackend operations outside ingest locked section
3 participants