Implement sweeper data versioning for repairkit #73
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🗒️ Summary
Implements a mechanism (in repairkit, but the approach is easily used elsewhere) to avoid redundant processing work. Previously, just no-op iterating through the initial query for repairkit was taking an inordinate amount of time. Now, only documents which haven't been updated with a version of repairkit GTE the current version are returned by the initial query.
For any sweeper, the sweeper version should be written as an integer to
f"ops:Provenance/ops:registry_sweepers_{sweeper_name}_version"
. This version should be updated in the sweeper'sconstants
submodule whenever a change is made to the sweeper which invalidates previous processing of documents.Because it can only result from a code change in the first place, I've elected to hard-code the version in
constants
instead of using a configuration file, for simplicity.Timeouts have been bumped to improve stability when run against prod registries. I've hardcoded them initially, but if they need to be changed a second time, I'll move them to a CLI argument.
There are additional unrelated changes/bugfixes included in this PR
⚙️ Test Data and/or Report
Manually tested against local, then en-prod (repairkit took 53min initially, and 0sec thereafter)
♻️ Related Issues
related to #61
fixes #70