-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingestion performance fixes #316
Conversation
@@ -192,7 +267,12 @@ func (ingest *Ingestion) Start() (err error) { | |||
return | |||
} | |||
|
|||
ingest.accountIDMapping = make(map[xdr.AccountId]int64) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per your concerns about this exploding, how about using an LRU cache instead?
perhaps something like https://github.com/hashicorp/golang-lru
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think caching GetCreateAccountID
results is just a temporary solution to allow us to ship this fix ASAP. The long-term solution can be one of the following:
- More complicated code that gets all account IDs after the ledger is ingested, inserts new accounts and updates
INSERT
queries before executing them. - DB schema redesign so getting IDs is not requires (accounts are identified by their AccountID).
- Something else?
@@ -192,7 +271,12 @@ func (ingest *Ingestion) Start() (err error) { | |||
return | |||
} | |||
|
|||
ingest.accountIDMapping = make(map[xdr.AccountId]int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cache is per ingestion session. We should probably keep it between sessions.
elapsed := time.Now().Sub(start2) | ||
fmt.Println("ingestTransaction", elapsed) | ||
}() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed after tests.
Before I type out my full comments, I just want to put this here: https://godoc.org/github.com/lib/pq#hdr-Bulk_imports We should probably look into using the above method for performing this bulk work as opposed to manually building huge sql statements with 60,000 bound parameters. |
Yeah, we should explore possible solutions but when it comes to bulk imports with The solution proposed in this PR can add thousands of rows in a single query to a DB by using |
OK, in e07a7d8 in-memory cache has been removed completely and now account IDs are batch loaded/inserted once per session (check Also improved the code for batch inserting:
Tests are broken now, will check them this week. |
@nullstyle I checked the performance of EDIT Actually query building can take some time when testing batch inserts however |
After changing the ingestion code to use
Overall The solution present in this PR (batch-inserts) is slower than I'm not comfortable with pushing this to production without proper, extensive testing. So I'm going to push the code I wrote today to another branch: Talked with @nullstyle and he agrees. |
so @bartekn should I be reviewing this PR for inclusion into master or do you have some more work you want to do before review? |
I need to fix tests and want to move all tables in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'm not happy with how much code is motion in this PR. The general flow of ingestion was already primed for batching and I'm skeptical we need to make this scale of change to the code to get our performance up.
"type", | ||
"details", | ||
) | ||
func (ingest *Ingestion) createInsertBuilderByTableName(name TableName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping this function up to date is going to be a pain.
@@ -101,34 +143,28 @@ func (ingest *Ingestion) Ledger( | |||
header *core.LedgerHeader, | |||
txs int, | |||
ops int, | |||
) error { | |||
|
|||
sql := ingest.ledgers.Values( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use this hunk to express another concern I have with this code:
The approach you take for reifying table rows is unnecessary and produces a maintenance headache. Instead of replacing this code with something brand new, instead you should replace sq.InsertBuilder
with a custom type named something similar to ingest.BatchInsertBuilder
.
This new type would still accept a call to Values
as the original type does, but it simply doesn't return a sql statement and we don't run ingest.DB.Exec(sql)
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this makes sense. But we still need some way to figure out which param is address and needs to be replaced with ID later. What do you think about creating a new type like:
type Address string
And we will be passing values like:
ingest.TransactionParticipantBatchInsertBuilder.Values(Address("G..."), param1, param2)
Then ingest.BatchInsertBuilder
will have a method to return all of addresses we need to get/create. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm cool with that approach. sounds good!
GetParams() []interface{} | ||
// UpdateAccountIDs updates fields with account IDs by using provided | ||
// address => id mapping. | ||
UpdateAccountIDs(accounts map[string]int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this approach to translating addresses into the synthetic account ids used by the history system.
The cleaner approach is add an additional phase to each round of ingestion:
- Collect all accounts that are involved in a ledger and get-or-create the
history_accounts
rows as a batch, populating an in-memory map that phase 2 can refer to - the rest of ingestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We can do this (A) or this (B). Which one is better for you? IMHO, seems to be a little better because we won't have to update a function collecting addresses if a new table is created but both are OK to me.
41667b3
to
0009f1f
Compare
Too much Git magic, |
Ingestion code is making too many DB requests:
GetCreateAccountID
is executed multiple times when checking row ID for a givenAccountID
.Because of so many requests, ingesting a single transaction with 100 payment operations takes 200-500 ms. It can grow to 20 seconds per ledger causing delays in ingestion.
This PR introduces the following changes:
INSERT
s to the database.GetCreateAccountID
in memory. EDIT Cache no longer used. See comments below.After implementing this changes ledger with 50 txs / 5000 operations is ingested in 3-5 seconds, ledger with 500 operations is ingested in 0.5 second.
However this is not ready to be merged. This needs to be addressed before merging:
ERRO[0078] import session failed: exec failed: pq: got 100000 parameters but PostgreSQL only supports 65535 parameters pid=15551
). Fixed in 98ddb01.Other thoughts:
We need to consider how quickly the cache can grow and what are the memory requirements.Cache no longer used.The cache makes sense when operations involve a small set of accounts. Otherwise,Cache no longer used.GetCreateAccountID
may still be a bottleneck. Currently I can't find a better solution than DB schema redesign (so accounts are identified byAccountID
instead of primary key in accounts table). Or maybe cache warmup on horizon init?