services/horizon: Remove --parallel-job-size config parameter #5468

urvisavla · 2024-09-20T07:13:23Z

Remove the --paralleljobsize config parameter and set the value to 0 internally. Optimal job size should be calculated internally based on the total range and number of workers specified. see:

go/services/horizon/internal/ingest/parallel.go

Line 116 in f668d5b

func calculateParallelLedgerBatchSize(rangeSize uint32, batchSizeSuggestion uint32, workerCount uint) uint32 {
Update the changelog.
Update public docs if necessary.
Ensure the buffer_size parameter does not exceed the job size.

This is a breaking change so it would be good to include it in the upcoming protocol 22 release.

Original Description (Keeping it for context):
TL;DR
For reingestion using the BufferedStorageBackend, the ParallelJobSize
parameter is currently set to a default of 100. This creates a conflict with the buffer_size parameter in the BufferedStorageBackend configuration. To fix this, we should either remove the ParallelJobSize parameter for reingestion using BufferedStorageBackend or set it to a larger value by default.

More info:
The ParallelJobSize parameter determines how many ledgers are processed in each task during parallel ingestion. It is set to 100k by default for Captive Core ingestion. This number was determined based on how many ledgers Captive Core can efficiently replay at once. However, when using the BufferedStorageBackend for reingestion, ParallelJobSize doesn’t really matter because the BufferedStorageBackend is stateless and isn’t affected by the number of ledgers assigned to it.

The BufferedStorageBackend has buffer_size config parameter which determines how many files are kept in memory. By setting ParallelJobSize to 100, we limit the usefulness of buffer_size. So for example, even if buffer_size is set to 1000, each task will still only process 100 ledgers due to the ParallelJobSize setting.

While ParallelJobSize doesn’t directly impact performance with the BufferedStorageBackend, it could be helpful to set it to a larger valu like 100k for more efficient reingestion of large ledger ranges. When reingesting larger ledger range, the older ledgers are smaller and faster to process. This means some workers might finish their tasks faster than others. By dividing the workload into smaller tasks, those faster workers can take on additional work, resulting in better efficiency.

tamirms · 2024-09-20T18:33:24Z

@urvisavla does the buffer_size config parameter have a default value?

You mentioned:

To fix this, we should either remove the ParallelJobSize parameter for reingestion using BufferedStorageBackend or set it to a larger value by default.

What about the option of overriding buffer_size so that it is equal to ParallelJobSize?

urvisavla · 2024-09-22T05:11:32Z

@urvisavla does the buffer_size config parameter have a default value?

Not currently but Shawn's PR adds default values based on the datastore schema #5462

You mentioned:

To fix this, we should either remove the ParallelJobSize parameter for reingestion using BufferedStorageBackend or set it to a larger value by default.

What about the option of overriding buffer_size so that it is equal to ParallelJobSize?

Currently, if buffer_size is set above 100, it is effectively overridden by ParallelJobSize(=100) because each task will process a range 100(ParallelJobSize) ledgers. The Ledger Buffer never queues ledgers past the end of the specified range.

Explaining the interaction between ParallelJobSize and buffer_size will be challenging and most of the time reingestion performance using BufferedStorageBackend will depend on setting the right buffer_size.

The best solution, imo, would be to deprecate ParallelJobSize entirely. I doubt many users know how to use it effectively even for Captive Core. We can set ParallelJobSize to 100,000 by default for both Captive Core and BufferedStorageBackend reingestion. This way, users only need to focus on tuning two things: 1) buffer_size and num_workers (download) based on their data lake schema and 2) the number of parallel reingestion workers depending on their hardware, beefier machines can handle more parallel reingestion.

@sreuland @mollykarcher

tamirms · 2024-09-24T09:18:14Z

Currently, if buffer_size is set above 100, it is effectively overridden by ParallelJobSize(=100) because each task will process a range 100(ParallelJobSize) ledgers. The Ledger Buffer never queues ledgers past the end of the specified range.

ok, I understand now, thanks for the explanation!

We can set ParallelJobSize to 100,000 by default for both Captive Core and BufferedStorageBackend reingestion.

I looked through the code and it seems difficult to determine the effect of the ParallelJobSize parameter because it turns out the parameter is just a suggestion. We also have a MaxLedgerPerFlush configuration parameter which further complicates the picture. So, I think we need to rethink ParallelJobSize and try to simplify how the value is derived.

But, I don't think the buffer_size in the BufferedStorageBackend toml file should influence the batch size of each reingestion worker. IMO it makes more sense if the the buffer_size in the BufferedStorageBackend config is set to the minimum between the value present in the toml file and the batch size of each reingestion worker.

urvisavla · 2024-09-24T21:19:03Z

I looked through the code and it seems difficult to determine the effect of the ParallelJobSize parameter because it turns out the parameter is just a suggestion. We also have a MaxLedgerPerFlush configuration parameter which further complicates the picture. So, I think we need to rethink ParallelJobSize and try to simplify how the value is derived.

Yes, I was suggesting that the parallel job size should not be user configurable; it's something we can set to a fixed value or as you mentioned calculate internally.

But, I don't think the buffer_size in the BufferedStorageBackend toml file should influence the batch size of each reingestion worker.

I'm not suggesting that the buffer_size in should influence the batch size but that the reverse is true.

IMO it makes more sense if the the buffer_size in the BufferedStorageBackend config is set to the minimum between the value present in the toml file and the batch size of each reingestion worker.

Agreed, this is essentially the implict effect but we can make it explicit.

tamirms · 2024-09-25T05:59:38Z

Yes, I was suggesting that the parallel job size should not be user configurable; it's something we can set to a fixed value or as you mentioned calculate internally.

sounds good! MaxLedgerPerFlush will ensure that ingestion does not consume too much memory during ingestion, so it should be safe to derive ParallelJobSize from the size of the total ledger range and the number of ingestion workers. I am on board with your plan to deprecate the ParallelJobSize parameter.

sreuland · 2024-09-26T23:21:11Z

+1 to deprecate the --parallel-job-size cli param in favor of internal optimization, less external config, bonus!

urvisavla · 2024-10-01T01:01:03Z

Thanks @tamirms @sreuland. I've updated the description and changed the scope from a bug to a feature.

urvisavla added bug cdp-horizon-scrum labels Sep 20, 2024

urvisavla added this to Platform Scrum Sep 24, 2024

github-project-automation bot moved this to Backlog in Platform Scrum Sep 24, 2024

urvisavla moved this from Backlog to To Do in Platform Scrum Sep 24, 2024

tamirms added this to the platform sprint 51 milestone Sep 24, 2024

urvisavla moved this from Backlog to To Do in Platform Scrum Oct 1, 2024

urvisavla changed the title ~~services/horizon: adjust ParallelJobSize parameter when reingesting from BufferedStorageBackend (datastore)~~ services/horizon: Remove --paralleljobsize config parameter Oct 1, 2024

urvisavla added feature request and removed bug labels Oct 1, 2024

urvisavla self-assigned this Oct 1, 2024

urvisavla changed the title ~~services/horizon: Remove --paralleljobsize config parameter~~ services/horizon: Remove --parallel-job-size config parameter Oct 3, 2024

urvisavla mentioned this issue Oct 3, 2024

services/horizon: Remove --parallel-job-size config parameter used for reingestion. #5484

Merged

7 tasks

urvisavla closed this as completed in #5484 Oct 14, 2024

github-project-automation bot moved this from In Progress to Done in Platform Scrum Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services/horizon: Remove --parallel-job-size config parameter #5468

services/horizon: Remove --parallel-job-size config parameter #5468

urvisavla commented Sep 20, 2024 •

edited

Loading

tamirms commented Sep 20, 2024

urvisavla commented Sep 22, 2024 •

edited

Loading

tamirms commented Sep 24, 2024

urvisavla commented Sep 24, 2024

tamirms commented Sep 25, 2024

sreuland commented Sep 26, 2024

urvisavla commented Oct 1, 2024

services/horizon: Remove --parallel-job-size config parameter #5468

services/horizon: Remove --parallel-job-size config parameter #5468

Comments

urvisavla commented Sep 20, 2024 • edited Loading

tamirms commented Sep 20, 2024

urvisavla commented Sep 22, 2024 • edited Loading

tamirms commented Sep 24, 2024

urvisavla commented Sep 24, 2024

tamirms commented Sep 25, 2024

sreuland commented Sep 26, 2024

urvisavla commented Oct 1, 2024

urvisavla commented Sep 20, 2024 •

edited

Loading

urvisavla commented Sep 22, 2024 •

edited

Loading