services/horizon: Removed LedgerBackend calls inside ingest critical section #3518

bartekn · 2021-03-30T22:37:57Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

This commit moves calls to LedgerBackend in buildState and resumeState outside critical section which is a DB transaction acquiring ingestion lock. To clean the code LedgerBackend.GetLedgerBlocking method was introduced that always blocks until the ledger is available in the backend. Additionally, GetLatestLedger method name was changed to GetLatestHistoryLedger for clarity and ProcessorRunner methods have changed and do not depend on LedgerBackend anymore.

Close #3517.

Why

In case of access to LedgerBackend hangs (ex. broken connection to Core DB or remote captive core) inside a critical section ingestion will halt in entire cluster. With this change only a single ingesting node will halt.

Known limitations

All of the other states that are reading from LedgerBackend were not changed (still access it in critical section). This is due to the fact that multiple ledgers are read (ex. historyRangeState) and they may not fit into memory.

…section

ingest/ledger_transaction_reader.go

tamirms · 2021-03-31T07:46:34Z

services/horizon/internal/ingest/fsm.go

@@ -570,7 +630,16 @@ func (h historyRangeState) run(s *system) (transition, error) {
 	}


do we still want to use s.maybePrepareRange() in this state? if not we can remove the maybePrepareRange() function entirely

Yes, unfortunately we have to keep access to LedgerBackend inside db transaction in this state. Alternatively, we could load all ledgers in a range into memory but for larger ranges it can cause OOM.

tamirms · 2021-03-31T07:52:52Z

services/horizon/internal/ingest/fsm.go

+
+// getLedgerFromBackend gets ledger from the backend. If range is not prepared
+// it will prepare it.
+func (s *system) getLedgerFromBackend(sequence uint32) (bool, *xdr.LedgerCloseMeta, error) {


When I read getLedgerFromBackend() I thought it was strange that we had a wrapper around ledgerBackend.GetLedger(). I think if we changed the function signature to the one below it would reduce confusion.

func prepareAndGet(ledgerBackend ledgerbackend.LedgerBackend, sequence uint32) (bool, *xdr.LedgerCloseMeta, error) { }

tamirms · 2021-03-31T08:07:23Z

services/horizon/internal/ingest/fsm.go

+		return false, &xdr.LedgerCloseMeta{}, nil
+	}
+
+	return s.ledgerBackend.GetLedger(sequence)


it might be too much to include in this PR but I think we should consider changing the LedgerBackend API so that we can add an argument to GetLedger() which configures whether the call should be blocking or non blocking. Currently, the blocking behavior of GetLedger() is configured implicitly depending on whether you call PrepareRange with a bounded or unbounded range. Which makes you have to internalize implementation details of CaptiveStellarCore to understand how GetLedger() will behave

having the blocking configuration passed into GetLedger() would also allow us to remove the fast-forward logic from buildState and resume. because we could change return s.ledgerBackend.GetLedger(sequence) so that it blocks until it finds ledger sequence

This is a valid point. I will try to do it in this PR.

services/horizon/internal/ingest/fsm.go

tamirms · 2021-04-01T12:55:11Z

ingest/ledgerbackend/captive_core_backend.go

+// available in the backend (even for UnboundedRange).
+// Please note that requesting a ledger sequence far after current ledger will
+// block the execution for a long time.
+func (c *CaptiveStellarCore) GetLedgerBlocking(sequence uint32) (xdr.LedgerCloseMeta, error) {


given that we have this function, should we make GetLedger() always be non blocking?

This also looks like it's not really threadsafe with GetLedger(), given it modifies the underlying thing. Wouldn't "the go way" be to have the default way be blocking, and use a goroutine/channel when you want it async? (When do we want it async, anyway?)

Edit: Ah, GetLedger(), is more like MaybeGetLedger() in that it might not exist in the backend yet.

I agree that we probably should make GetLedger blocking by default but before doing this I think we should 1) ensure it doesn't cause unexpected issues (we should be able to tell after testing Horizon release with this PR), 2) we are modifying stable ingest package so we should do it via major release. Going to create an issue about it soon.

paulbellamy · 2021-04-01T16:51:35Z

To some extent, it seems like the code around fsm.go, and the processor mix up concerns around "How do I get ledgers", and "What do I do with the data". Both are combined in the history.IngestionQ, which leads the code to mix together their usage. Splitting the ledger fetching methods out of history.IngestionQ might help.

I guess the structure I expected to see was something more like: LedgerReader -> IngestionProcessor -> Database, then with a top-level loop something like:

db := NewDB()
processor := NewIngestionProcessor(db)
for ledgers.Next() {
  ledger, err := ledgers.Get()
  err = processor.Ingest(ledger) // or Verify(ledger)
}

bartekn · 2021-04-02T11:40:25Z

@paulbellamy I'm not sure I understand.

seems like the code around fsm.go, and the processor mix up concerns around "How do I get ledgers", and "What do I do with the data".

I think "How do I get ledgers" concern was really mixed before that PR because ProcessorRunner called GetLedger of the backend to get ledger data. This PR actually solves this issue, due to different reasons but it seems to be solved. Right now we get a single ledger data in fsm.go and pass it to ProcessorRunner to process.

I guess the structure I expected to see was something more like: LedgerReader -> IngestionProcessor -> Database, then with a top-level loop something like:

The code you shared was actually how the previous ingesting system was running (pre 1.0.0) and there was nothing wrong with the main loop in particular. However, because we added state ingestion (from history buckets) and distributed ingestion it complicated this a lot: right now it's possible that when Horizon is ingesting ledgers one-by-one it may escape to another FSM state, ex. to rebuild ledger state because of the upgrade etc. OR it may need to skip one or more ledger ingested by another machine.

Also, the FSM was designed when non-blocking GetLedger was used (and non-blocking version was used because back then there was no captive core). I think that if we are happy with the blocking version we may want to rewrite resumeState to be a loop similar to your snippet. But the rest of the states will need to stay like this.

Please elaborate, maybe I'm missing something.

paulbellamy · 2021-04-02T11:46:55Z

No thats’s pretty close to my understanding now as well. Wasn’t a comment on this pr in particular, just generally.

…

On Fri, 2 Apr 2021 at 12:40, Bartek Nowotarski ***@***.***> wrote: @paulbellamy <https://github.com/paulbellamy> I'm not sure I understand. seems like the code around fsm.go, and the processor mix up concerns around "How do I get ledgers", and "What do I do with the data". I think "How do I get ledgers" concern was really mixed before that PR because ProcessorRunner called GetLedger of the backend to get ledger data. This PR actually solves this issue, due to different reasons but it seems to be solved. Right now we get a single ledger data in fsm.go and pass it to ProcessorRunner to process. I guess the structure I expected to see was something more like: LedgerReader -> IngestionProcessor -> Database, then with a top-level loop something like: The code you shared was actually how the previous ingesting system was running (pre 1.0.0) and there was nothing wrong with the main loop in particular. However, because we added state ingestion (from history buckets) and distributed ingestion it complicated this a lot: right now it's possible that when Horizon is ingesting ledgers one-by-one it may escape to another FSM state, ex. to rebuild ledger state because of the upgrade etc. OR it may need to skip one or more ledger ingested by another machine. Also, the FSM was designed when non-blocking GetLedger was used. I *think* that if we are happy with the blocking version we may want to rewrite resumeState to be a loop similar to your snippet. But the rest of the states will need to stay like this. Please elaborate, maybe I'm missing something. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3518 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5CV26ONVVSEUAWLCFDTLTGWULNANCNFSM42DD4QMQ> .

bartekn · 2021-04-02T11:47:34Z

Awesome!

…rtekn/go into remove-ledgerbackend-critical-section

services/horizon/internal/ingest/fsm.go

bartekn · 2021-04-08T13:02:57Z

@tamirms FYI I changed GetLedger to GetLedgerBlocking in historyRangeState: 1a06520.

services/horizon/internal/ingest/fsm.go

tamirms · 2021-04-09T07:58:22Z

services/horizon/internal/ingest/fsm.go

@@ -644,7 +655,16 @@ func (h reingestHistoryRangeState) ingestRange(s *system, fromLedger, toLedger u
 	}

 	for cur := fromLedger; cur <= toLedger; cur++ {
-		if err = runTransactionProcessorsOnLedger(s, cur); err != nil {
+		exists, ledgerCloseMeta, err := s.ledgerBackend.GetLedger(cur)


should this also be blocking? I think it should be blocking for captive core. But, when using a db ledger backend I guess we expect that the ledger should already be in the db. If that reasoning is correct we could just add a comment to make this blocking once we remove the db ledger backend

It doesn't need to use GetLedgerBlocking because we prepare a bounded range in reingestHistoryRangeState which will make CC blocking. But I agree, we we can make it blocking explicitly after updating ledger backend GetLedger to be always blocking (and when we remove DB backend). Not adding a comment because when above is done, it will not be possible to get ledgers in a non-blocking way.

tamirms

I think in a future PR we should make captive core always block on GetLedger() because I can't think of any cases in fsm.go where want it to be non-blocking

services/horizon: Removed LedgerBackend calls inside ingest critical …

3edb4fa

…section

tamirms reviewed Mar 31, 2021

View reviewed changes

ingest/ledger_transaction_reader.go Outdated Show resolved Hide resolved

tamirms reviewed Mar 31, 2021

View reviewed changes

services/horizon/internal/ingest/fsm.go Show resolved Hide resolved

bartekn added 4 commits March 31, 2021 12:34

Revert LedgerBackend.GetLedger

1728435

Fix

7d7e3b1

Add LedgerBackend.GetLedgerBlocking() and use in FSM

cf7d9e0

fix tests

4b8f681

tamirms reviewed Apr 1, 2021

View reviewed changes

services/horizon/internal/ingest/fsm.go Outdated Show resolved Hide resolved

tamirms reviewed Apr 1, 2021

View reviewed changes

Fix maybePrepareRange

ace6341

bartekn added 4 commits April 2, 2021 17:58

Fix tets

9b517db

Fix test

0fdbf77

Fix shadow

1e542b3

Merge branch 'master' into remove-ledgerbackend-critical-section

030a088

bartekn marked this pull request as ready for review April 2, 2021 16:15

bartekn requested a review from a team April 2, 2021 16:15

bartekn added 2 commits April 2, 2021 18:19

Remove unused

06cdfa9

Merge branch 'remove-ledgerbackend-critical-section' of github.com:ba…

c97d511

…rtekn/go into remove-ledgerbackend-critical-section

tamirms reviewed Apr 5, 2021

View reviewed changes

services/horizon/internal/ingest/fsm.go Show resolved Hide resolved

Switch to GetLedgerBlocking in historyRangeState

1a06520

tamirms reviewed Apr 8, 2021

View reviewed changes

services/horizon/internal/ingest/fsm.go Show resolved Hide resolved

Commit finished work on ledgerbackend issue

364f7e9

Add genesis state test

963b0a6

tamirms reviewed Apr 9, 2021

View reviewed changes

tamirms approved these changes Apr 9, 2021

View reviewed changes

bartekn added 3 commits April 12, 2021 12:56

Merge branch 'master' into remove-ledgerbackend-critical-section

d290386

Remove consts

6ea2db5

fix

257ddcb

bartekn merged commit 3d97762 into stellar:master Apr 12, 2021

bartekn deleted the remove-ledgerbackend-critical-section branch April 12, 2021 12:11

tamirms mentioned this pull request Apr 16, 2021

all: Refactor db.Session to take ctx into each call, instead of storing it on the session level #3545

Merged

9 tasks

bartekn mentioned this pull request Apr 16, 2021

Change ingest.ledgerbackend.LedgerBackend.GetLedger to be always blocking #3549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services/horizon: Removed LedgerBackend calls inside ingest critical section #3518

services/horizon: Removed LedgerBackend calls inside ingest critical section #3518

bartekn commented Mar 30, 2021 •

edited

Loading

tamirms Mar 31, 2021

bartekn Mar 31, 2021

tamirms Mar 31, 2021

tamirms Mar 31, 2021

tamirms Mar 31, 2021

bartekn Mar 31, 2021

tamirms Apr 1, 2021

paulbellamy Apr 1, 2021 •

edited

Loading

bartekn Apr 2, 2021

paulbellamy commented Apr 1, 2021

bartekn commented Apr 2, 2021 •

edited

Loading

paulbellamy commented Apr 2, 2021 via email

bartekn commented Apr 2, 2021

bartekn commented Apr 8, 2021

tamirms Apr 9, 2021

bartekn Apr 12, 2021

tamirms left a comment

		@@ -570,7 +630,16 @@ func (h historyRangeState) run(s *system) (transition, error) {
		}

services/horizon: Removed LedgerBackend calls inside ingest critical section #3518

services/horizon: Removed LedgerBackend calls inside ingest critical section #3518

Conversation

bartekn commented Mar 30, 2021 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulbellamy Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulbellamy commented Apr 1, 2021

bartekn commented Apr 2, 2021 • edited Loading

paulbellamy commented Apr 2, 2021 via email

bartekn commented Apr 2, 2021

bartekn commented Apr 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamirms left a comment

Choose a reason for hiding this comment

bartekn commented Mar 30, 2021 •

edited

Loading

paulbellamy Apr 1, 2021 •

edited

Loading

bartekn commented Apr 2, 2021 •

edited

Loading