-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services/horizon: Removed LedgerBackend calls inside ingest critical section #3518
services/horizon: Removed LedgerBackend calls inside ingest critical section #3518
Conversation
@@ -570,7 +630,16 @@ func (h historyRangeState) run(s *system) (transition, error) { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still want to use s.maybePrepareRange()
in this state? if not we can remove the maybePrepareRange()
function entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unfortunately we have to keep access to LedgerBackend
inside db transaction in this state. Alternatively, we could load all ledgers in a range into memory but for larger ranges it can cause OOM.
|
||
// getLedgerFromBackend gets ledger from the backend. If range is not prepared | ||
// it will prepare it. | ||
func (s *system) getLedgerFromBackend(sequence uint32) (bool, *xdr.LedgerCloseMeta, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read getLedgerFromBackend()
I thought it was strange that we had a wrapper around ledgerBackend.GetLedger()
. I think if we changed the function signature to the one below it would reduce confusion.
func prepareAndGet(ledgerBackend ledgerbackend.LedgerBackend, sequence uint32) (bool, *xdr.LedgerCloseMeta, error) {
}
return false, &xdr.LedgerCloseMeta{}, nil | ||
} | ||
|
||
return s.ledgerBackend.GetLedger(sequence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be too much to include in this PR but I think we should consider changing the LedgerBackend API so that we can add an argument to GetLedger()
which configures whether the call should be blocking or non blocking. Currently, the blocking behavior of GetLedger()
is configured implicitly depending on whether you call PrepareRange with a bounded or unbounded range. Which makes you have to internalize implementation details of CaptiveStellarCore
to understand how GetLedger()
will behave
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having the blocking configuration passed into GetLedger()
would also allow us to remove the fast-forward logic from buildState and resume. because we could change return s.ledgerBackend.GetLedger(sequence)
so that it blocks until it finds ledger sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid point. I will try to do it in this PR.
// available in the backend (even for UnboundedRange). | ||
// Please note that requesting a ledger sequence far after current ledger will | ||
// block the execution for a long time. | ||
func (c *CaptiveStellarCore) GetLedgerBlocking(sequence uint32) (xdr.LedgerCloseMeta, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we have this function, should we make GetLedger()
always be non blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also looks like it's not really threadsafe with GetLedger()
, given it modifies the underlying thing. Wouldn't "the go way" be to have the default way be blocking, and use a goroutine/channel when you want it async? (When do we want it async, anyway?)
Edit: Ah, GetLedger()
, is more like MaybeGetLedger()
in that it might not exist in the backend yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we probably should make GetLedger
blocking by default but before doing this I think we should 1) ensure it doesn't cause unexpected issues (we should be able to tell after testing Horizon release with this PR), 2) we are modifying stable ingest
package so we should do it via major release. Going to create an issue about it soon.
To some extent, it seems like the code around I guess the structure I expected to see was something more like: db := NewDB()
processor := NewIngestionProcessor(db)
for ledgers.Next() {
ledger, err := ledgers.Get()
err = processor.Ingest(ledger) // or Verify(ledger)
} |
@paulbellamy I'm not sure I understand.
I think "How do I get ledgers" concern was really mixed before that PR because
The code you shared was actually how the previous ingesting system was running (pre 1.0.0) and there was nothing wrong with the main loop in particular. However, because we added state ingestion (from history buckets) and distributed ingestion it complicated this a lot: right now it's possible that when Horizon is ingesting ledgers one-by-one it may escape to another FSM state, ex. to rebuild ledger state because of the upgrade etc. OR it may need to skip one or more ledger ingested by another machine. Also, the FSM was designed when non-blocking Please elaborate, maybe I'm missing something. |
No thats’s pretty close to my understanding now as well. Wasn’t a comment
on this pr in particular, just generally.
…On Fri, 2 Apr 2021 at 12:40, Bartek Nowotarski ***@***.***> wrote:
@paulbellamy <https://github.com/paulbellamy> I'm not sure I understand.
seems like the code around fsm.go, and the processor mix up concerns
around "How do I get ledgers", and "What do I do with the data".
I think "How do I get ledgers" concern was really mixed before that PR
because ProcessorRunner called GetLedger of the backend to get ledger
data. This PR actually solves this issue, due to different reasons but it
seems to be solved. Right now we get a single ledger data in fsm.go and
pass it to ProcessorRunner to process.
I guess the structure I expected to see was something more like:
LedgerReader -> IngestionProcessor -> Database, then with a top-level loop
something like:
The code you shared was actually how the previous ingesting system was
running (pre 1.0.0) and there was nothing wrong with the main loop in
particular. However, because we added state ingestion (from history
buckets) and distributed ingestion it complicated this a lot: right now
it's possible that when Horizon is ingesting ledgers one-by-one it may
escape to another FSM state, ex. to rebuild ledger state because of the
upgrade etc. OR it may need to skip one or more ledger ingested by another
machine.
Also, the FSM was designed when non-blocking GetLedger was used. I *think*
that if we are happy with the blocking version we may want to rewrite
resumeState to be a loop similar to your snippet. But the rest of the
states will need to stay like this.
Please elaborate, maybe I'm missing something.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3518 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5CV26ONVVSEUAWLCFDTLTGWULNANCNFSM42DD4QMQ>
.
|
Awesome! |
…rtekn/go into remove-ledgerbackend-critical-section
@@ -644,7 +655,16 @@ func (h reingestHistoryRangeState) ingestRange(s *system, fromLedger, toLedger u | |||
} | |||
|
|||
for cur := fromLedger; cur <= toLedger; cur++ { | |||
if err = runTransactionProcessorsOnLedger(s, cur); err != nil { | |||
exists, ledgerCloseMeta, err := s.ledgerBackend.GetLedger(cur) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also be blocking? I think it should be blocking for captive core. But, when using a db ledger backend I guess we expect that the ledger should already be in the db. If that reasoning is correct we could just add a comment to make this blocking once we remove the db ledger backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to use GetLedgerBlocking
because we prepare a bounded range in reingestHistoryRangeState
which will make CC blocking. But I agree, we we can make it blocking explicitly after updating ledger backend GetLedger
to be always blocking (and when we remove DB backend). Not adding a comment because when above is done, it will not be possible to get ledgers in a non-blocking way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in a future PR we should make captive core always block on GetLedger()
because I can't think of any cases in fsm.go where want it to be non-blocking
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
This commit moves calls to
LedgerBackend
inbuildState
andresumeState
outside critical section which is a DB transaction acquiring ingestion lock. To clean the codeLedgerBackend.GetLedgerBlocking
method was introduced that always blocks until the ledger is available in the backend. Additionally,GetLatestLedger
method name was changed toGetLatestHistoryLedger
for clarity andProcessorRunner
methods have changed and do not depend onLedgerBackend
anymore.Close #3517.
Why
In case of access to
LedgerBackend
hangs (ex. broken connection to Core DB or remote captive core) inside a critical section ingestion will halt in entire cluster. With this change only a single ingesting node will halt.Known limitations
All of the other states that are reading from
LedgerBackend
were not changed (still access it in critical section). This is due to the fact that multiple ledgers are read (ex.historyRangeState
) and they may not fit into memory.