-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services/horizon/ingest: RebuildTradeAggregationBuckets cannot be invoked concurrently during parallel ingestion #5127
Comments
the suggested solution will work if one reingestion job is running parallel workers on a single host machine against the one horizon db. However, if two reingestion jobs are running, each job running on a separate host machine, and the jobs are not using parallel workers but running adjacent ledger ranges to each other, than the same error related to trade agg rebuild can still occur on one of the jobs, whichever one tries to commit to the db last.
the trade agg rebuild issue was observed on distributed machines with adjacent ranges during performance benchmarking on staging environment with the 2.28.0 release candidate: @mollykarcher ,rather than trying to patch this at the various levels, adding complexity, there appears to be two options to take:
|
@sreuland was this issue introduced with the changes on 2.28 or did this issue always exist? I know it was called out / referenced here but the wording ( As an aside, is there a good reason that the trade aggregation buckets need to be aligned/bucketed by minutes instead of by ledger? I would think we could achieve the same thing if we bucketed by ledger, and do the to-minute (or whatever unit of real time) conversion on the user request side, and that would also alleviate this issue. I understand this might be more involved (and potentially break the interface) and so not sure it's something we'd want to tackle right now. Of the two options you presented, I think 1 is the significantly better choice. I think it would actually be nice to have support for trade aggregations be an explicit/intentional opt-in by the Horizon operator (and I actually think not many operators would want to use it). I do worry about people being really confused by this though, given it's a new step we're introducing and they're likely to forgot/skip it. Would there be any way that we could be aware of this state (trade-aggs not properly rebuilt yet) after a reingest and either disable or return some error code on /trade-aggs, rather than junk data? |
@mollykarcher , The trade agg bucket rebuild issue with parallel workers was pre-existing prior to 2.28/ingestion perf overhaul, based on seeing the same error condition happen when running I think, these may be complementary iterations, and propose moving forward with: first pass, this ticket, fix the parallel workers as suggested to do trade agg rebuild once at end rather than each worker doing a trade agg rebuild, this will go to 2.28.0 release second pass, created a new ticket, #5169, trade agg refactor options to capture your suggestion of changing the alignment to lower complexity for processes to use it above, this will enable parallelism in terms of tradeagg functionality, such as now can run multiple rebuilds with adjacent ledger ranges on same machine or distributed. this would be planned to go in later horizon release. |
merged #5168 |
What version are you using?
build from latest on master
What did you do?
Run parallel reingestion jobs,
What did you expect to see?
RebuildTradeAggregationBuckets() should work without error given parallel workers.
What did you see instead?
RebuildTradeAggregationBuckets() cannot be invoked concurrently during parallel ingestion because there will be duplicate key constraint errors when two workers invoke the function on adjacent buckets. That is because the buckets occur on minute boundaries and two adjacent ledger ranges will share the same trade aggregations bucket.
This can potentially be fixed by modifying parallel reingestion so that the trade aggregation buckets are built once all the workers have completed their ingestion jobs.
related to #5099
The text was updated successfully, but these errors were encountered: