Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: Remove --parallel-job-size config parameter #5468

Closed
4 tasks
urvisavla opened this issue Sep 20, 2024 · 7 comments · Fixed by #5484
Closed
4 tasks

services/horizon: Remove --parallel-job-size config parameter #5468

urvisavla opened this issue Sep 20, 2024 · 7 comments · Fixed by #5484

Comments

@urvisavla
Copy link
Contributor

urvisavla commented Sep 20, 2024

  • Remove the --paralleljobsize config parameter and set the value to 0 internally. Optimal job size should be calculated internally based on the total range and number of workers specified. see:
    func calculateParallelLedgerBatchSize(rangeSize uint32, batchSizeSuggestion uint32, workerCount uint) uint32 {
  • Update the changelog.
  • Update public docs if necessary.
  • Ensure the buffer_size parameter does not exceed the job size.

This is a breaking change so it would be good to include it in the upcoming protocol 22 release.


Original Description (Keeping it for context):
TL;DR
For reingestion using the BufferedStorageBackend, the ParallelJobSize
parameter is currently set to a default of 100. This creates a conflict with the buffer_size parameter in the BufferedStorageBackend configuration. To fix this, we should either remove the ParallelJobSize parameter for reingestion using BufferedStorageBackend or set it to a larger value by default.

More info:
The ParallelJobSize parameter determines how many ledgers are processed in each task during parallel ingestion. It is set to 100k by default for Captive Core ingestion. This number was determined based on how many ledgers Captive Core can efficiently replay at once. However, when using the BufferedStorageBackend for reingestion, ParallelJobSize doesn’t really matter because the BufferedStorageBackend is stateless and isn’t affected by the number of ledgers assigned to it.

The BufferedStorageBackend has buffer_size config parameter which determines how many files are kept in memory. By setting ParallelJobSize to 100, we limit the usefulness of buffer_size. So for example, even if buffer_size is set to 1000, each task will still only process 100 ledgers due to the ParallelJobSize setting.

While ParallelJobSize doesn’t directly impact performance with the BufferedStorageBackend, it could be helpful to set it to a larger valu like 100k for more efficient reingestion of large ledger ranges. When reingesting larger ledger range, the older ledgers are smaller and faster to process. This means some workers might finish their tasks faster than others. By dividing the workload into smaller tasks, those faster workers can take on additional work, resulting in better efficiency.

@tamirms
Copy link
Contributor

tamirms commented Sep 20, 2024

@urvisavla does the buffer_size config parameter have a default value?

You mentioned:

To fix this, we should either remove the ParallelJobSize parameter for reingestion using BufferedStorageBackend or set it to a larger value by default.

What about the option of overriding buffer_size so that it is equal to ParallelJobSize?

@urvisavla
Copy link
Contributor Author

urvisavla commented Sep 22, 2024

@urvisavla does the buffer_size config parameter have a default value?

Not currently but Shawn's PR adds default values based on the datastore schema #5462

You mentioned:

To fix this, we should either remove the ParallelJobSize parameter for reingestion using BufferedStorageBackend or set it to a larger value by default.

What about the option of overriding buffer_size so that it is equal to ParallelJobSize?

Currently, if buffer_size is set above 100, it is effectively overridden by ParallelJobSize(=100) because each task will process a range 100(ParallelJobSize) ledgers. The Ledger Buffer never queues ledgers past the end of the specified range.

Explaining the interaction between ParallelJobSize and buffer_size will be challenging and most of the time reingestion performance using BufferedStorageBackend will depend on setting the right buffer_size.

The best solution, imo, would be to deprecate ParallelJobSize entirely. I doubt many users know how to use it effectively even for Captive Core. We can set ParallelJobSize to 100,000 by default for both Captive Core and BufferedStorageBackend reingestion. This way, users only need to focus on tuning two things: 1) buffer_size and num_workers (download) based on their data lake schema and 2) the number of parallel reingestion workers depending on their hardware, beefier machines can handle more parallel reingestion.

@sreuland @mollykarcher

@urvisavla urvisavla moved this from Backlog to To Do in Platform Scrum Sep 24, 2024
@tamirms
Copy link
Contributor

tamirms commented Sep 24, 2024

Currently, if buffer_size is set above 100, it is effectively overridden by ParallelJobSize(=100) because each task will process a range 100(ParallelJobSize) ledgers. The Ledger Buffer never queues ledgers past the end of the specified range.

ok, I understand now, thanks for the explanation!

We can set ParallelJobSize to 100,000 by default for both Captive Core and BufferedStorageBackend reingestion.

I looked through the code and it seems difficult to determine the effect of the ParallelJobSize parameter because it turns out the parameter is just a suggestion. We also have a MaxLedgerPerFlush configuration parameter which further complicates the picture. So, I think we need to rethink ParallelJobSize and try to simplify how the value is derived.

But, I don't think the buffer_size in the BufferedStorageBackend toml file should influence the batch size of each reingestion worker. IMO it makes more sense if the the buffer_size in the BufferedStorageBackend config is set to the minimum between the value present in the toml file and the batch size of each reingestion worker.

@tamirms tamirms added this to the platform sprint 51 milestone Sep 24, 2024
@urvisavla
Copy link
Contributor Author

I looked through the code and it seems difficult to determine the effect of the ParallelJobSize parameter because it turns out the parameter is just a suggestion. We also have a MaxLedgerPerFlush configuration parameter which further complicates the picture. So, I think we need to rethink ParallelJobSize and try to simplify how the value is derived.

Yes, I was suggesting that the parallel job size should not be user configurable; it's something we can set to a fixed value or as you mentioned calculate internally.

But, I don't think the buffer_size in the BufferedStorageBackend toml file should influence the batch size of each reingestion worker.

I'm not suggesting that the buffer_size in should influence the batch size but that the reverse is true.

IMO it makes more sense if the the buffer_size in the BufferedStorageBackend config is set to the minimum between the value present in the toml file and the batch size of each reingestion worker.

Agreed, this is essentially the implict effect but we can make it explicit.

@tamirms
Copy link
Contributor

tamirms commented Sep 25, 2024

Yes, I was suggesting that the parallel job size should not be user configurable; it's something we can set to a fixed value or as you mentioned calculate internally.

sounds good! MaxLedgerPerFlush will ensure that ingestion does not consume too much memory during ingestion, so it should be safe to derive ParallelJobSize from the size of the total ledger range and the number of ingestion workers. I am on board with your plan to deprecate the ParallelJobSize parameter.

@sreuland
Copy link
Contributor

+1 to deprecate the --parallel-job-size cli param in favor of internal optimization, less external config, bonus!

@urvisavla urvisavla moved this from Backlog to To Do in Platform Scrum Oct 1, 2024
@urvisavla urvisavla changed the title services/horizon: adjust ParallelJobSize parameter when reingesting from BufferedStorageBackend (datastore) services/horizon: Remove --paralleljobsize config parameter Oct 1, 2024
@urvisavla
Copy link
Contributor Author

Thanks @tamirms @sreuland. I've updated the description and changed the scope from a bug to a feature.

@urvisavla urvisavla self-assigned this Oct 1, 2024
@urvisavla urvisavla changed the title services/horizon: Remove --paralleljobsize config parameter services/horizon: Remove --parallel-job-size config parameter Oct 3, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Platform Scrum Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants