-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services/horizon: Precompute 1m Trade Aggregation buckets #3641
services/horizon: Precompute 1m Trade Aggregation buckets #3641
Conversation
For reading this, I'd recommend starting with the migration, as it'll help make sense of the rest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking good. Looking forward to some response durations stats. My only worry is reingestion in ranges. In such case ranges can be ingested in non-increasing order so the triggers here will calculate values wrong.
I think we can fix this but it requires adding an extra step once reingestion is done:
- We move the trigger to Go code and run it after a ledger is ingested using normal operations (not reingestion).
- In reingestion code we add an extra step that runs once all ranges are completed which is responsible for calculating 1m resolutions.
services/horizon/internal/db2/schema/migrations/47_precompute_trade_aggregations.sql
Outdated
Show resolved
Hide resolved
services/horizon/internal/db2/schema/migrations/47_precompute_trade_aggregations.sql
Outdated
Show resolved
Hide resolved
Idea to fix Please note that the open/close price will still be incorrect if the bucket was ingested partially. |
This seems like a good plan. We actually only need to completely-recalculate the first and last buckets, as those are the only ones which would be only partially-ingested. |
3e88881
to
c1fe1e5
Compare
93e3b55
to
1ac4160
Compare
1ac4160
to
0c002ac
Compare
Ok, so benchmarking on my local laptop. Let's find the biggest most-traded pairs, to set a worst-case scenario bound... Top asset-pairs query
So Pre-computed 1m buckets query: ~900ms
On-the-fly 1d buckets query: ~1300ms
ThoughtsKeep in mind these are worst-cases (loading all of time for the most traded asset pair). Seems like most of the time is spent unwinding the jsonb. Intuitively I feel like that could be reduced, as it seems like the limit param isn't being propagated down as far as it could. Maybe I'm missing an index I had before? 🤔 For comparison with the "normal" case, with counter asset It also might be worth doing a benchmark without the jsonb storage. It ~doubles the data storage, but might make the query times much happier. Update: Benchmarked without the jsonb storage.Non-jsonb Pre-computed 1m buckets query: ~30ms
Non-jsonb On-the-fly 1d buckets query: ~1000ms
Non-jsonb inserts: 1-15ms
So, expectedly, the raw non-jsonb retrieval is blazing fast. Interestingly the on-the-fly is still quite slow, since it is dominated by the GroupAggregate work. We could do the same trigger (more-or-less) to pre-compute the 5 and 15-minute buckets as well, which would probably help, but this adds ~1300MB. Basically, we should to decide if the extra disk usage (1600MB vs 500MB with jsonb) is worth the increased query speed (or find another way to reduce the disk usage, maybe a side-table for the asset-pair, or tuning the field types to reduce row width). |
Thinking about this more, and benchmarking a bit more, I think we should not do the jsonb thing for now. The faster query times will take more load off the DB, and hard-drives are cheap(ish). Let's opt for the ~1600MB disk usage, and fastest query times. Because all the data is available, if/when disk usage later becomes an issue we can reconsider a move to jsonb (via db migration, no reingest needed), with more knowledge about the usage patterns. |
b129046
to
2cb5733
Compare
- Just calls the ~same thing once-per-insert-batch from go - Unknowns - Does it actually work? - Is the "in" query for buckets efficient, or would a range be better? - Is full-rebuild of buckets efficient, or should we do incremental? Seems safer if it works.
2cb5733
to
e5c3966
Compare
9a011d2
to
6e35397
Compare
@@ -753,6 +760,11 @@ func (h reingestHistoryRangeState) run(s *system) (transition, error) { | |||
} | |||
} | |||
|
|||
err = s.historyQ.RebuildTradeAggregationBuckets(s.ctx, h.fromLedger, h.toLedger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, running this once at the end of the reingestion means that the aggregations will be out-of-date until the reingestion finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could decide we're fine with that, or do some detection and rebuild it after each minute chunk. Running it once at the end is way more efficient though, as it can just load the whole range of trades at once (instead of loading each minute's).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per benchmarking with @jacekn it seems like the previous slowness was caused by missing an index to use for the DeleteRangeAll call. If we're concerned with consistency during reingestion we can retry rebuilding the buckets after each ledger, and time that.
5229099
to
cdd97b4
Compare
cdd97b4
to
25ac86c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it looks good to me! I added one major comment about delete part of RebuildTradeAggregationBuckets
and a few minor comments. I know you checked the speed of the new thing (which looks really good!) but have you checked the correctness? I think what we could do is: deploy your branch to stg, rebuild buckets and then get as many /trade_aggregation
requests from the access log and use horizon-cmp
to check if the stable release returns the same data.
FROM htrd | ||
GROUP by base_asset_id, counter_asset_id, timestamp | ||
ORDER BY timestamp | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should remove recalculations from DB migration. Not everyone is using trade aggregations and not everyone can afford 9m downtime (which I guess can be even slower on lower spec machines/disks). What about leaving a query that create a table here but recalculating when someone explicitly runs horizon db reingest ...
? There are obviously disadvantages of this so I'd rather want to discuss it first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this actually need to cause downtime? I guess so, as the table locks would conflict with ingestion... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After actually deploying this and using it, I think I agree. Having the migrations take a long time is a pain, as is maintaining the query in 2 places. Let's remove it from the migration.
_, err := q.Exec(ctx, sq.Delete("history_trades_60000").Where( | ||
sq.GtOrEq{"open_ledger": fromLedger}, | ||
sq.Lt{"open_ledger": toLedger}, | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something is wrong here or I misunderstand what history_trades_60000
holds. Given that history_trades_60000
holds one minute trade aggregations buckets, check this chart:
---|-------------------|----------
O | C |
| |
X Y
The line in the top is one of the already aggregated buckets in a DB, it starts at ledger O (open) and ends at C (close). Then we run reingest of X-Y ledgers. What we want to delete buckets starting at O and C and I think that the highlighted code doesn't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, yeah. We actually needed to find the first and last bucket timestamps from the given ledger range, not the actual ledger values. Otherwise we would only include partial data when rebuilding. Nice catch, thanks.
Co-authored-by: Bartek Nowotarski <[email protected]>
Co-authored-by: Bartek Nowotarski <[email protected]>
05d750f
to
d3b4695
Compare
d3b4695
to
41cf9fe
Compare
Note, when ingesting this failed with:
So, we are not clearing out the correct old buckets. |
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
Incremental pre-computation for the 1-minute trade-aggregation buckets with a database-trigger. Compute higher-resolution queries on-the-fly.
Why
Fixes #632
Per discussion pre-computing seems the best option for performance and scalability, as the amount of computation is known and bounded. The concern is disk usage. Given a raw
history_trades
table of 4gb, a naive 1m bucket pre-computation results in a new table approx 1.6GB. Not outrageous, but could grow quickly if more trading pairs become active. When generating that as a materialized view, I accidentally forgot to include thebase_asset_id
, andcounter_asset_id
, and noticed the table was much smaller. To compress the table this PR uses wide jsonb columns, so there is one row per year/month/asset-pair, with avalues
column containing the month's 1m buckets. This adds complexity and a small overhead on querying (~15%), but saves a further ~70% on disk size.In order to handle the
offset
param the maximum resolution we could potentially pre-compute is 15 minutes. For this PR, I've only pre-computed the 1 minute buckets, and aggregated higher resolutions on the fly. As higher resolutions are quite quick to aggregate on-the-fly. In my past experience this has been enough, as the 1m aggregation gives higher resolutions a big boost. (TODO: benchmark this for concrete numbers)Initially I hoped we could just use a materialized view to do the pre-computation (to avoid changes to horizon ingestion code). However, updating materialized views is all-or-nothing, and takes ~7 minutes on my laptop. Instead, this PR uses a trigger to incrementally update the relevant buckets on insert. This implementation is still quite fragile, in that it doesn't handle
UPDATE
orDELETE
queries, and assumes all trades are inserted in-order. Non-incremental recompute of a 1m bucket takes ~300ms. I didn't want to add that overhead to each insert, but it might be an easy way to handle update/delete queries.Known limitations
This trigger is still quite fragile, in that it doesn't handle
UPDATE
orDELETE
queries, and assumes all trades are inserted in-order.The reduced query load on
history_trades
means we could probably get rid of some of the now-unused indexes on that table, which would save a few gigabytes of disk.TODO
Later