-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(router): avoid worker starvation during job pickup #2379
Conversation
5bc7efa
to
9a47c30
Compare
@@ -704,7 +705,6 @@ func loadConfig() { | |||
config.RegisterDurationConfigVariable(5, &refreshDSListLoopSleepDuration, true, time.Second, []string{"JobsDB.refreshDSListLoopSleepDuration", "JobsDB.refreshDSListLoopSleepDurationInS"}...) | |||
config.RegisterDurationConfigVariable(5, &backupCheckSleepDuration, true, time.Second, []string{"JobsDB.backupCheckSleepDuration", "JobsDB.backupCheckSleepDurationIns"}...) | |||
config.RegisterDurationConfigVariable(5, &cacheExpiration, true, time.Minute, []string{"JobsDB.cacheExpiration"}...) | |||
useJoinForUnprocessed = config.GetBool("JobsDB.useJoinForUnprocessed", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain: removing this config option since we are always using a join for unprocessed
@@ -17,14 +17,15 @@ type Options struct { | |||
} | |||
|
|||
// LoadOptions loads application's initialisation options based on command line flags and environment | |||
func LoadOptions() *Options { | |||
func LoadOptions(args []string) *Options { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain: adding support for calling Run
multiple times within the same test
|
||
type MultiTenantLegacy struct { | ||
*HandleT | ||
} | ||
|
||
func (mj *MultiTenantLegacy) GetAllJobs(ctx context.Context, workspaceCount map[string]int, params GetQueryParamsT, _ int) ([]*JobT, error) { // skipcq: CRT-P0003 | ||
type legacyMoreToken struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain: for legacy query we need to keep track of each subQuery's latest job ID, thus we are using a different type of MoreToken
type ( | ||
MoreToken interface{} | ||
moreToken struct { | ||
afterJobIDs map[string]*int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain: for fair pickup algorithm we need to keep track of all the latest job IDs per workspace
5473a65
to
af4f68c
Compare
jd.markClearEmptyResult(ds, allWorkspaces, stateFilters, customValFilters, parameterFilters, willTryToSet, nil) | ||
|
||
var stateQuery, customValQuery, limitQuery, sourceQuery string | ||
skipCacheResult := params.AfterJobID != nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain: we don't update the cache if the query contains an AfterJobID
parameter
Codecov ReportBase: 43.15% // Head: 44.63% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2379 +/- ##
==========================================
+ Coverage 43.15% 44.63% +1.47%
==========================================
Files 186 188 +2
Lines 40020 39148 -872
==========================================
+ Hits 17272 17475 +203
+ Misses 21642 20581 -1061
+ Partials 1106 1092 -14
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
1c77cb7
to
edbabb0
Compare
func main() { | ||
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) | ||
exitCode := Run(ctx) | ||
r := runner.New(runner.ReleaseInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain: moved running logic away from main
, to the runner
package, so that we can move integration test files in the packages that they belong.
5035000
to
1bc0630
Compare
3376cd2
to
2a744c6
Compare
8fa22d9
to
6a4074e
Compare
f414ca7
to
ffb9cf2
Compare
LGTM |
This behaviour is intentional. The iterator only compensates for discarded jobs to avoid ending up with an unproductive pickup loop.
After an iterator completes, the next pickup loop will create a fresh iterator with fresh limits. The time between two pickup loops should remain small enough and shouldn't cause any significant delays in picking up newly arrived jobs. |
ffb9cf2
to
8607282
Compare
One other thought is that since we query by
to check whether or not to query a DS(except the last DS). |
Yes this would be an option, however, since Thus I would argue that we can start without this special optimisation branch in our codebase. |
We'd also want to have a memory limit over how the |
Since discarded jobs will be eligible for garbage collection, we can rely on that, if memory resources become scarce, they will get garbage collected. |
8607282
to
f57b594
Compare
Awesome work! 🎉 |
f57b594
to
64bd423
Compare
64bd423
to
7af8db7
Compare
Description
Introducing
jobiterator
package inrouter
, responsible for performing additional queries against jobsDB in order to fetch more jobs in case that some of the initially picked-up jobs get discarded (e.g. due to job ordering barrier, throttling or backoff). The following (configurable) limitations apply:Router.jobIterator.maxQueries
(default: 10). Setting this to 1, effectively disables the feature.Router.jobIterator.discardedPercentageTolerance
(default: 10%)Note: an additional limitation applies when
JobsDB.fairPickup
is enabled: A maximum number ofRouter.maxDSQuery
datasets can be queried at any time (default: 10).Additionally, the following changes/improvements have been introduced:
destinationID
in the job order key along withuserID
.latenciesUsed
&timeGained
fromGetRouterPickupJobs
.JobsDB.useJoinForUnprocessed
.main
, to therunner
package, so that we can move integration test files in the packages that they belong to.Notion Ticket
Link
Security
BEGIN_COMMIT_OVERRIDE
feat(router): avoid worker starvation during job pickup (#2379)
END_COMMIT_OVERRIDE