Skip to content
This repository has been archived by the owner on Mar 16, 2021. It is now read-only.

V3: extract parallelized commit processing into common utility class #401

Merged
merged 3 commits into from
Nov 14, 2018

Conversation

dtivel
Copy link
Contributor

@dtivel dtivel commented Nov 13, 2018

Progress on https://github.com/NuGet/Engineering/issues/1843.

At a high level this change aims to do 3 things:

  • extract existing parallelized commit processing functionality into a common utility class so it can be leveraged by Catalog2Dnx and Joel's work on search
  • improve the implementation
  • add tests for both new and existing code

Here is the general approach for parallelizing catalog commits in jobs like Catalog2Registration and Catalog2Dnx:

Where P represents a package ID and T a commit timestamp, suppose we have the following catalog commits for fixed timestamp range T₀-T₃:

          P₀   P₁   P₂   P₃
   ^      ------------------
   |      T₃   T₃
   |                     T₂
   |      T₁        T₁
  time    T₀   T₀

Each column contains all catalog commits for its corresponding package ID (not identity).

Each column is represented by a CatalogCommitBatchTask instance. Each group of columns sharing the same minimum commit timestamp is represented by a CatalogCommitBatchTasks instance. In the above example, there will be 3 CatalogCommitBatchTasks instances:

  1. an instance for P₀ and P₁, since they share the same minimum commit timestamp T₀
  2. an instance for P₂
  3. an instance for P₃

RegistrationCollector (and presumably DnxCatalogCollector in the future) processes maxConcurrentBatches columns in parallel. The default is 10.

Only when all processing for a CatalogCommitBatchTasks instance completes successfully will a CommitCollector update its cursor. In the above example, the cursor will be updated as follows:

  1. When both P₀-T₀ and P₁-T₀ succeed, update the cursor to T₀.
  2. When P₂-T₁ succeeds, update the cursor to T₁. Successful completion of P₀-T₁ is implied by completion of the previous step.
  3. When P₃-T₂ succeeds, update the cursor to T₃. Successful completion of P₀-T₁ and P₂-T₁ is implied by completion of the previous step. Note that since this is the last work item for the given time range, the cursor is not updated to T₂ but instead the maximum commit timestamp for the entire commit timestamp range.

This PR adds 2 improvements over the existing implementation:

  1. Because updating the cursor is costly, updating the cursor to the latest successfully committed timestamp instead of the current commit timestamp. The above example already demonstrates this in step 3 by updating the cursor to T₃ instead of first T₂ and then T₃.
  2. Only the most recent catalog commit per package identity needs to be processed; older commits are skipped. In the above example, P₀-T₀, P₀-T₁, and P₁-T₀ would be skipped.

Copy link
Member

@joelverhagen joelverhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@joelverhagen
Copy link
Member

Nice one 🥇. It is perhaps worth noting to posterity that the aim of this PR is to maintain as many assumptions as possible that are guaranteed by our current implementation.

For example, the part about tracking the minimum commit timestamp of a batch before filtering out non-latest leaves per package identity... this is not strictly necessary to maintain contract of any of our V3 resources but is essentially done by today's implementation of, say, catalog2registration.

src/Catalog/CatalogCommit.cs Outdated Show resolved Hide resolved
@dtivel dtivel merged commit b87f57b into dev Nov 14, 2018
@dtivel dtivel deleted the dev-dtivel-parallelizecommits branch November 14, 2018 00:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants