-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services/horizon/internal/ingest: Fix deadlock in parallel ingestion #5263
Conversation
is it worth taking step back from current lookup table based on a sequential id generator and explore any other table schema that might do better in current flow? since the application flows are now getting more complex due to non-app related db concerns? One option - lookup tables use a hash of the string lookup key to generate deterministic |
Is there a issue/ticket that captures this for visibility on project side? it seems like this would be part of Sustainability/Object-8 and this time/effort can be accounted for. |
the challenge with that approach is dealing with collisions and also we'd have to reingest. but it's worth thinking about. however, that falls outside the scope of this pr |
this PR fixes a bug so it doesn't get pointed |
@tamirms , one more change is to skip filtered tmp processor during reingest, pr , I could merge that into here or do it separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@sreuland I think it would be easier to review that change separately. I've configured auto-merge on this PR so it should be merged soon |
was there a GH issue on the bug? |
no |
@devfed1 this fix was released in version 2.30.0 |
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
I tried running the reingest command with 2 parallel workers and ran into the following postgres deadlock error which caused the reingestion to fail:
This deadlock occurs because we have two go routines which both try to insert the same addresses into the
history_accounts
table but in a different order. For example:t
go routine 1 tries to insert accountA
.t+1
go routine 2 tries to insert accountB
.t+3
go routine 2 tries to insert accountA
and is blocked by go routine 1 which holds the ShareLock on rowA
. go routine 2 cannot progress until go routine 1 commits or rolls back its transaction.t+4
go routine 1 tries to insert accountB
and is blocked by go routine 2 which holds the ShareLock on rowB
. Now there is a deadlock because there is a cycle of dependencies.This deadlock scenario should not be possible because we sort the account addresses before inserting into
history_accounts
. But, it turns out that sorting alone is not sufficient to prevent this type of deadlock because of the data flow for parallel reingestionLet's say we want to reingest ledgers 1 - 1000 with 2 parallel workers. Horizon will spin up 2 go routines. The first go routine will ingest ledgers 1 - 500 and the second go routine will ingest ledgers 501 - 1000. Each go routine will perform the ingestion of their subrange in a single transaction.
However, each go routine will further subdivide their subrange into batches of ledgers. Each batch will be ingested separately, which means that we will insert rows into
history_accounts
when ingesting each batch. So even if the account address are sorted in each individual batch, that does not mean the account addresses across all batches executed within a single transaction are sorted. Therefore, the possibility of a deadlock remains.To fix this issue I have encapsulated the ingestion of each ledger batch within a single transaction. Each go routine will no longer perform the ingestion of their subrange in a single transaction, instead the go routines will execute multiple transactions (corresponding to each ledger batch) during their ingestion of the subrange.
With the change described above, a transaction in parallel ingestion may block on another if they're inserting common addresses into
history_accounts
, but there should never be a deadlock. However, I noticed that performance could be improved by having more fine grained transactions. The workflow of each transaction resembles the following:First, we insert into the lookup tables and then we insert into the history tables using the ids from the lookup tables. The first part of the transaction can block if there are other transactions which insert common addresses into
history_accounts
, however, the second part of the transaction should never block on other concurrent transactions.So, by splitting the workflow into two transactions we can reduce contention and allow more concurrency when inserting rows into history tables:
After implementing that change, I observed a 10-15% speed up in parallel ingestion when testing locally on my laptop using 2 workers.
I think there could be other data flow changes which could also improve performance by reducing contention in the postgres transactions. One data flow which I think could be promising is:
We can experiment with this idea in another PR
Known limitations
If individual transactions fail it could leave the DB in a partially ingested state. This can still be avoided by ingesting without concurrency (albeit without any performance speedups that concurrency gives you). But this limitation existed even prior to this change.