-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services/horizon: Remove --parallel-job-size config parameter #5468
Comments
@urvisavla does the buffer_size config parameter have a default value? You mentioned:
What about the option of overriding buffer_size so that it is equal to |
Not currently but Shawn's PR adds default values based on the datastore schema #5462
Currently, if buffer_size is set above 100, it is effectively overridden by ParallelJobSize(=100) because each task will process a range 100(ParallelJobSize) ledgers. The Ledger Buffer never queues ledgers past the end of the specified range. Explaining the interaction between ParallelJobSize and buffer_size will be challenging and most of the time reingestion performance using BufferedStorageBackend will depend on setting the right buffer_size. The best solution, imo, would be to deprecate ParallelJobSize entirely. I doubt many users know how to use it effectively even for Captive Core. We can set ParallelJobSize to 100,000 by default for both Captive Core and BufferedStorageBackend reingestion. This way, users only need to focus on tuning two things: 1) buffer_size and num_workers (download) based on their data lake schema and 2) the number of parallel reingestion workers depending on their hardware, beefier machines can handle more parallel reingestion. |
ok, I understand now, thanks for the explanation!
I looked through the code and it seems difficult to determine the effect of the ParallelJobSize parameter because it turns out the parameter is just a suggestion. We also have a MaxLedgerPerFlush configuration parameter which further complicates the picture. So, I think we need to rethink ParallelJobSize and try to simplify how the value is derived. But, I don't think the buffer_size in the BufferedStorageBackend toml file should influence the batch size of each reingestion worker. IMO it makes more sense if the the buffer_size in the BufferedStorageBackend config is set to the minimum between the value present in the toml file and the batch size of each reingestion worker. |
Yes, I was suggesting that the parallel job size should not be user configurable; it's something we can set to a fixed value or as you mentioned calculate internally.
I'm not suggesting that the buffer_size in should influence the batch size but that the reverse is true.
Agreed, this is essentially the implict effect but we can make it explicit. |
sounds good! MaxLedgerPerFlush will ensure that ingestion does not consume too much memory during ingestion, so it should be safe to derive ParallelJobSize from the size of the total ledger range and the number of ingestion workers. I am on board with your plan to deprecate the ParallelJobSize parameter. |
+1 to deprecate the |
go/services/horizon/internal/ingest/parallel.go
Line 116 in f668d5b
This is a breaking change so it would be good to include it in the upcoming protocol 22 release.
Original Description (Keeping it for context):
TL;DR
For reingestion using the BufferedStorageBackend, the ParallelJobSize
parameter is currently set to a default of 100. This creates a conflict with the buffer_size parameter in the BufferedStorageBackend configuration. To fix this, we should either remove the ParallelJobSize parameter for reingestion using BufferedStorageBackend or set it to a larger value by default.
More info:
The ParallelJobSize parameter determines how many ledgers are processed in each task during parallel ingestion. It is set to 100k by default for Captive Core ingestion. This number was determined based on how many ledgers Captive Core can efficiently replay at once. However, when using the BufferedStorageBackend for reingestion, ParallelJobSize doesn’t really matter because the BufferedStorageBackend is stateless and isn’t affected by the number of ledgers assigned to it.
The BufferedStorageBackend has buffer_size config parameter which determines how many files are kept in memory. By setting ParallelJobSize to 100, we limit the usefulness of buffer_size. So for example, even if buffer_size is set to 1000, each task will still only process 100 ledgers due to the ParallelJobSize setting.
While ParallelJobSize doesn’t directly impact performance with the BufferedStorageBackend, it could be helpful to set it to a larger valu like 100k for more efficient reingestion of large ledger ranges. When reingesting larger ledger range, the older ledgers are smaller and faster to process. This means some workers might finish their tasks faster than others. By dividing the workload into smaller tasks, those faster workers can take on additional work, resulting in better efficiency.
The text was updated successfully, but these errors were encountered: