Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon/ingest: RebuildTradeAggregationBuckets cannot be invoked concurrently during parallel ingestion #5127

Closed
sreuland opened this issue Nov 28, 2023 · 4 comments
Assignees
Labels
bug ingest New ingestion system performance issues aimed at improving performance

Comments

@sreuland
Copy link
Contributor

What version are you using?

build from latest on master

What did you do?

Run parallel reingestion jobs,

What did you expect to see?

RebuildTradeAggregationBuckets() should work without error given parallel workers.

What did you see instead?

RebuildTradeAggregationBuckets() cannot be invoked concurrently during parallel ingestion because there will be duplicate key constraint errors when two workers invoke the function on adjacent buckets. That is because the buckets occur on minute boundaries and two adjacent ledger ranges will share the same trade aggregations bucket.

This can potentially be fixed by modifying parallel reingestion so that the trade aggregation buckets are built once all the workers have completed their ingestion jobs.

related to #5099

@sreuland sreuland added ingest New ingestion system bug performance issues aimed at improving performance labels Nov 28, 2023
@sreuland sreuland changed the title RebuildTradeAggregationBuckets cannot be invoked concurrently during parallel ingestion services/horizon/ingest: RebuildTradeAggregationBuckets cannot be invoked concurrently during parallel ingestion Nov 28, 2023
@mollykarcher mollykarcher moved this from Backlog to Next Sprint Proposal in Platform Scrum Jan 2, 2024
@mollykarcher mollykarcher moved this from Next Sprint Proposal to Current Sprint in Platform Scrum Jan 5, 2024
@sreuland sreuland moved this from Current Sprint to In Progress in Platform Scrum Jan 11, 2024
@sreuland
Copy link
Contributor Author

sreuland commented Jan 13, 2024

This can potentially be fixed by modifying parallel reingestion so that the trade aggregation buckets are built once all the workers have completed their ingestion jobs.

the suggested solution will work if one reingestion job is running parallel workers on a single host machine against the one horizon db. However, if two reingestion jobs are running, each job running on a separate host machine, and the jobs are not using parallel workers but running adjacent ledger ranges to each other, than the same error related to trade agg rebuild can still occur on one of the jobs, whichever one tries to commit to the db last.

error="Error rebuilding trade aggregations: could not rebuild trade aggregation bucket: exec failed: pq: duplicate key value violates unique constraint \"history_trades_60000_pkey\ 

the trade agg rebuild issue was observed on distributed machines with adjacent ranges during performance benchmarking on staging environment with the 2.28.0 release candidate:

https://docs.google.com/document/d/1hxWY2i41o1b6_wcEyyXlzNt1msFpKiRuhKEI0jyEu50/edit#heading=h.1fe0zb4s2ye

@mollykarcher ,rather than trying to patch this at the various levels, adding complexity, there appears to be two options to take:

  1. lower complexity by just removing the trade aggregation bucket rebuild attempt from any horizon db reingest range, and add a new sub-command horizon ingest trigger-tradeagg-rebuild which follows existing similar pattern of user initiated sub command to rebuild state horizon ingest trigger-state-rebuild

  2. do the parallel ingestion fix as suggested which will fix single host re-ingestion and then add some more docs to the horizon db reingest range sub-command stating not to run the command with adjacent ledger ranges concurrently.

@sreuland sreuland self-assigned this Jan 16, 2024
@mollykarcher
Copy link
Contributor

@sreuland was this issue introduced with the changes on 2.28 or did this issue always exist? I know it was called out / referenced here but the wording (....bugs will need to be addressed when implementing this issue) makes it sound as if these were already present issues.

As an aside, is there a good reason that the trade aggregation buckets need to be aligned/bucketed by minutes instead of by ledger? I would think we could achieve the same thing if we bucketed by ledger, and do the to-minute (or whatever unit of real time) conversion on the user request side, and that would also alleviate this issue. I understand this might be more involved (and potentially break the interface) and so not sure it's something we'd want to tackle right now.

Of the two options you presented, I think 1 is the significantly better choice. I think it would actually be nice to have support for trade aggregations be an explicit/intentional opt-in by the Horizon operator (and I actually think not many operators would want to use it). I do worry about people being really confused by this though, given it's a new step we're introducing and they're likely to forgot/skip it. Would there be any way that we could be aware of this state (trade-aggs not properly rebuilt yet) after a reingest and either disable or return some error code on /trade-aggs, rather than junk data?

@sreuland
Copy link
Contributor Author

sreuland commented Jan 17, 2024

@mollykarcher , The trade agg bucket rebuild issue with parallel workers was pre-existing prior to 2.28/ingestion perf overhaul, based on seeing the same error condition happen when running db reingest range with 2.27.0.

I think, these may be complementary iterations, and propose moving forward with:

first pass, this ticket, fix the parallel workers as suggested to do trade agg rebuild once at end rather than each worker doing a trade agg rebuild, this will go to 2.28.0 release

second pass, created a new ticket, #5169, trade agg refactor options to capture your suggestion of changing the alignment to lower complexity for processes to use it above, this will enable parallelism in terms of tradeagg functionality, such as now can run multiple rebuilds with adjacent ledger ranges on same machine or distributed. this would be planned to go in later horizon release.

sreuland added a commit to sreuland/go that referenced this issue Jan 17, 2024
sreuland added a commit to sreuland/go that referenced this issue Jan 17, 2024
sreuland added a commit to sreuland/go that referenced this issue Jan 17, 2024
@sreuland sreuland moved this from In Progress to Needs Review in Platform Scrum Jan 17, 2024
@sreuland
Copy link
Contributor Author

merged #5168

@github-project-automation github-project-automation bot moved this from Needs Review to Done in Platform Scrum Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ingest New ingestion system performance issues aimed at improving performance
Projects
Status: Done
Development

No branches or pull requests

2 participants