Pending job batch queue for compaction job creation #3678
Labels
enhancement
New feature or request
parent-issue
An issue that is or should be split into multiple sub-issues
Milestone
Background
Split from:
Description
We'd like to parallelise compaction job creation by adding a separate step to actually send the jobs. The compaction job creation lambda can just decide what compaction jobs should be run, create large batches of them and send them to a pending job batch queue. A separate lambda can send the jobs, with multiple instances handling batches in parallel.
This should avoid requiring the compaction job creation lambda to make a large number of API calls to both create and send all the compaction jobs.
Analysis
The code in
CreateCompactionJobs.batchCreateJobs
is what we want to move to another lambda, as it takes some time to execute for all the batches. For each batch, we can write the jobs to S3, then put a message on a pending jobs queue pointing to the batch. This would allow multiple lambda instances to receive those batches and create the jobs. This should allow us to make the batch size much bigger (sleeper.table.compaction.job.send.batch.size
), and to create a much larger number of compaction jobs in a single invocation (sleeper.compaction.job.creation.limit
).For each batch, as we add it to the pending jobs queue we can also submit a file assignment commit for the whole batch to the state store committer. The lambda that receives the batch can check if the file assignment has been applied, and if not it can put it back on the queue with a delay. A batch that has been retried enough times without file assignment can go to the dead letter queue.
We can avoid status store updates in the compaction job creation lambda, and do it in the pending jobs lambda instead. We could keep the created status and apply that if we're putting a batch back on the queue. We could record in the message that we've already applied that status when we put it back on the queue.
The text was updated successfully, but these errors were encountered: