feat(router): avoid worker starvation during job pickup #2379

atzoum · 2022-08-31T18:30:40Z

Description

Introducing jobiterator package in router, responsible for performing additional queries against jobsDB in order to fetch more jobs in case that some of the initially picked-up jobs get discarded (e.g. due to job ordering barrier, throttling or backoff). The following (configurable) limitations apply:

Iterator will stop querying after Router.jobIterator.maxQueries (default: 10). Setting this to 1, effectively disables the feature.
Iterator will stop querying if the percentage of discarded jobs for the running query is less than or equal to Router.jobIterator.discardedPercentageTolerance (default: 10%)

Note: an additional limitation applies when JobsDB.fairPickup is enabled: A maximum number of Router.maxDSQuery datasets can be queried at any time (default: 10).

Additionally, the following changes/improvements have been introduced:

Include destinationID in the job order key along with userID.
Remove latenciesUsed & timeGained from GetRouterPickupJobs.
Remove JobsDB.useJoinForUnprocessed.
Move server startup logic away from package main, to the runner package, so that we can move integration test files in the packages that they belong to.

Notion Ticket

Link

Security

The code changed/added as part of this pull request won't create any security issues with how the software is being used.

BEGIN_COMMIT_OVERRIDE
feat(router): avoid worker starvation during job pickup (#2379)
END_COMMIT_OVERRIDE

atzoum · 2022-10-17T14:39:16Z

jobsdb/jobsdb.go

@@ -704,7 +705,6 @@ func loadConfig() {
 	config.RegisterDurationConfigVariable(5, &refreshDSListLoopSleepDuration, true, time.Second, []string{"JobsDB.refreshDSListLoopSleepDuration", "JobsDB.refreshDSListLoopSleepDurationInS"}...)
 	config.RegisterDurationConfigVariable(5, &backupCheckSleepDuration, true, time.Second, []string{"JobsDB.backupCheckSleepDuration", "JobsDB.backupCheckSleepDurationIns"}...)
 	config.RegisterDurationConfigVariable(5, &cacheExpiration, true, time.Minute, []string{"JobsDB.cacheExpiration"}...)
-	useJoinForUnprocessed = config.GetBool("JobsDB.useJoinForUnprocessed", true)


Explain: removing this config option since we are always using a join for unprocessed

atzoum · 2022-10-17T15:11:14Z

app/options.go

@@ -17,14 +17,15 @@ type Options struct {
 }

 // LoadOptions loads application's initialisation options based on command line flags and environment
-func LoadOptions() *Options {
+func LoadOptions(args []string) *Options {


Explain: adding support for calling Run multiple times within the same test

atzoum · 2022-10-17T15:13:06Z

jobsdb/unionQueryLegacy.go


 type MultiTenantLegacy struct {
 	*HandleT
 }

-func (mj *MultiTenantLegacy) GetAllJobs(ctx context.Context, workspaceCount map[string]int, params GetQueryParamsT, _ int) ([]*JobT, error) { // skipcq: CRT-P0003
+type legacyMoreToken struct {


Explain: for legacy query we need to keep track of each subQuery's latest job ID, thus we are using a different type of MoreToken

atzoum · 2022-10-17T15:19:59Z

jobsdb/unionQuery.go

+type (
+	MoreToken interface{}
+	moreToken struct {
+		afterJobIDs map[string]*int64


Explain: for fair pickup algorithm we need to keep track of all the latest job IDs per workspace

atzoum · 2022-10-17T15:27:18Z

jobsdb/jobsdb.go

-	jd.markClearEmptyResult(ds, allWorkspaces, stateFilters, customValFilters, parameterFilters, willTryToSet, nil)
-
-	var stateQuery, customValQuery, limitQuery, sourceQuery string
+	skipCacheResult := params.AfterJobID != nil


Explain: we don't update the cache if the query contains an AfterJobID parameter

codecov · 2022-10-17T15:33:33Z

Codecov Report

Base: 43.15% // Head: 44.63% // Increases project coverage by +1.47% 🎉

Coverage data is based on head (f414ca7) compared to base (79e3e34).
Patch coverage: 89.91% of modified lines in pull request are covered.

❗ Current head f414ca7 differs from pull request most recent head 7af8db7. Consider uploading reports for the commit 7af8db7 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2379      +/-   ##
==========================================
+ Coverage   43.15%   44.63%   +1.47%     
==========================================
  Files         186      188       +2     
  Lines       40020    39148     -872     
==========================================
+ Hits        17272    17475     +203     
+ Misses      21642    20581    -1061     
+ Partials     1106     1092      -14

Impacted Files	Coverage Δ
services/multitenant/legacy.go	`0.00% <0.00%> (ø)`
services/multitenant/noop.go	`0.00% <0.00%> (ø)`
jobsdb/unionQueryLegacy.go	`62.29% <79.31%> (+5.15%)`	⬆️
jobsdb/unionQuery.go	`81.69% <84.72%> (-2.69%)`	⬇️
router/router.go	`74.83% <89.39%> (+7.10%)`	⬆️
jobsdb/jobsdb.go	`68.80% <91.13%> (+3.36%)`	⬆️
router/internal/jobiterator/jobiterator.go	`100.00% <100.00%> (ø)`
services/multitenant/tenantstats.go	`82.77% <100.00%> (-0.23%)`	⬇️
services/streammanager/kafka/client/consumer.go	`75.92% <0.00%> (-3.71%)`	⬇️
... and 25 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

atzoum · 2022-10-18T11:04:40Z

main.go

 func main() {
 	ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
-	exitCode := Run(ctx)
+	r := runner.New(runner.ReleaseInfo{


Explain: moved running logic away from main, to the runner package, so that we can move integration test files in the packages that they belong.

Sidddddarth · 2022-10-28T13:31:38Z

LGTM
Only reservation I have is resetting the pickupMap in the iterator.HasNext() between successive iterations.
Because this way we'd only be picking up discarded number of jobs in further iterations.
In case we don't find as many jobs in the first iteration and several more are stored(enough to fill the pickupmap counts), we'd end up not picking them up soon enough.

atzoum · 2022-10-28T16:16:19Z

Because this way we'd only be picking up discarded number of jobs in further iterations.

This behaviour is intentional. The iterator only compensates for discarded jobs to avoid ending up with an unproductive pickup loop.

In case we don't find as many jobs in the first iteration and several more are stored(enough to fill the pickupmap counts), we'd end up not picking them up soon enough.

After an iterator completes, the next pickup loop will create a fresh iterator with fresh limits. The time between two pickup loops should remain small enough and shouldn't cause any significant delays in picking up newly arrived jobs.

Sidddddarth · 2022-10-29T13:22:24Z

One other thought is that since we query by job_id as well now, we could use

type dataSetRangeT struct {
	minJobID  int64
	maxJobID  int64
	startTime int64
	endTime   int64
	ds        dataSetT
}

to check whether or not to query a DS(except the last DS).

atzoum · 2022-10-29T13:57:24Z

One other thought is that since we query by job_id as well now, we could use
type dataSetRangeT struct {
	minJobID  int64
	maxJobID  int64
	startTime int64
	endTime   int64
	ds        dataSetT
}
to check whether or not to query a DS(except the last DS).

Yes this would be an option, however, since job_id is an indexed column, the sql query itself should be sufficiently quick and lightweight, and should handle equally efficiently the absence of results due to the job_id condition.

Thus I would argue that we can start without this special optimisation branch in our codebase.

Sidddddarth · 2022-10-31T07:12:24Z

We'd also want to have a memory limit over how the jobIterator fetches jobs right..?
Because every iteration it could fetch many jobs and discard them and all these jobs would be in memory till they get garbage collected..

atzoum · 2022-10-31T08:34:34Z

till they get garbage collected..

Since discarded jobs will be eligible for garbage collection, we can rely on that, if memory resources become scarce, they will get garbage collected.

router/router.go

chandumlg · 2022-11-01T06:11:30Z

Awesome work! 🎉

github-actions bot added server-team with tests labels Aug 31, 2022

github-actions bot added the Stale label Sep 21, 2022

atzoum removed the Stale label Sep 21, 2022

github-actions bot added the Stale label Oct 12, 2022

atzoum removed the Stale label Oct 12, 2022

atzoum force-pushed the feat.routerDeepPickup branch 2 times, most recently from 5bc7efa to 9a47c30 Compare October 17, 2022 09:18

atzoum commented Oct 17, 2022

View reviewed changes

atzoum force-pushed the feat.routerDeepPickup branch from 5473a65 to af4f68c Compare October 17, 2022 15:25

atzoum commented Oct 17, 2022

View reviewed changes

atzoum marked this pull request as ready for review October 17, 2022 15:33

atzoum force-pushed the feat.routerDeepPickup branch 4 times, most recently from 1c77cb7 to edbabb0 Compare October 18, 2022 08:57

atzoum commented Oct 18, 2022

View reviewed changes

atzoum force-pushed the feat.routerDeepPickup branch from 5035000 to 1bc0630 Compare October 18, 2022 11:30

rudderlabs deleted a comment from github-actions bot Oct 18, 2022

atzoum force-pushed the feat.routerDeepPickup branch 2 times, most recently from 3376cd2 to 2a744c6 Compare October 18, 2022 14:04

atzoum marked this pull request as draft October 18, 2022 14:21

atzoum changed the title ~~[WIP] feat(router): avoid worker starvation during job pickup~~ feat(router): avoid worker starvation during job pickup Oct 18, 2022

atzoum force-pushed the feat.routerDeepPickup branch 2 times, most recently from 8fa22d9 to 6a4074e Compare October 19, 2022 11:25

atzoum force-pushed the feat.routerDeepPickup branch from f414ca7 to ffb9cf2 Compare October 27, 2022 06:20

Sidddddarth approved these changes Oct 28, 2022

View reviewed changes

atzoum force-pushed the feat.routerDeepPickup branch from ffb9cf2 to 8607282 Compare October 29, 2022 08:03

atzoum force-pushed the feat.routerDeepPickup branch from 8607282 to f57b594 Compare October 31, 2022 12:16

chandumlg reviewed Nov 1, 2022

View reviewed changes

router/router.go Outdated Show resolved Hide resolved

chandumlg approved these changes Nov 1, 2022

View reviewed changes

feat(router): avoid worker starvation during job pickup

ae8332a

atzoum force-pushed the feat.routerDeepPickup branch from f57b594 to 64bd423 Compare November 1, 2022 06:46

fixup! feat(router): avoid worker starvation during job pickup

7af8db7

atzoum force-pushed the feat.routerDeepPickup branch from 64bd423 to 7af8db7 Compare November 1, 2022 06:48

atzoum merged commit 0ec74d1 into master Nov 1, 2022

cisse21 deleted the feat.routerDeepPickup branch November 1, 2022 07:06

rudder-server-bot mentioned this pull request Nov 1, 2022

chore: release 1.3.0 #2562

Closed

atzoum mentioned this pull request Nov 3, 2022

chore: remove leftover flag.Parse() #2643

Merged

1 task

This was referenced Nov 7, 2022

chore: release 1.3.0 #2665

Merged

chore: prerelease 1.3.0-rc.1 #2666

Merged

chore: prerelease 1.3.0-rc.2 #2673

Merged

chore: prerelease 1.3.0-rc.3 #2681

Merged

chore: release 1.3.0 #2695

Closed

chore: prerelease 1.4.0-rc.1 #2696

Closed

This was referenced Dec 1, 2022

chore: release 1.3.0 #2749

Closed

chore: prerelease 1.4.0-rc.1 #2750

Merged

chore: release 1.4.0 #2752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(router): avoid worker starvation during job pickup #2379

feat(router): avoid worker starvation during job pickup #2379

atzoum commented Aug 31, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

codecov bot commented Oct 17, 2022 •

edited

Loading

atzoum Oct 18, 2022 •

edited

Loading

Sidddddarth commented Oct 28, 2022

atzoum commented Oct 28, 2022

Sidddddarth commented Oct 29, 2022

atzoum commented Oct 29, 2022 •

edited

Loading

Sidddddarth commented Oct 31, 2022

atzoum commented Oct 31, 2022

chandumlg commented Nov 1, 2022

feat(router): avoid worker starvation during job pickup #2379

feat(router): avoid worker starvation during job pickup #2379

Conversation

atzoum commented Aug 31, 2022 • edited Loading

Description

Notion Ticket

Security

atzoum Oct 17, 2022 • edited Loading

Choose a reason for hiding this comment

atzoum Oct 17, 2022 • edited Loading

Choose a reason for hiding this comment

atzoum Oct 17, 2022 • edited Loading

Choose a reason for hiding this comment

atzoum Oct 17, 2022 • edited Loading

Choose a reason for hiding this comment

atzoum Oct 17, 2022 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Oct 17, 2022 • edited Loading

Codecov Report

atzoum Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Sidddddarth commented Oct 28, 2022

atzoum commented Oct 28, 2022

Sidddddarth commented Oct 29, 2022

atzoum commented Oct 29, 2022 • edited Loading

Sidddddarth commented Oct 31, 2022

atzoum commented Oct 31, 2022

chandumlg commented Nov 1, 2022

atzoum commented Aug 31, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

atzoum Oct 17, 2022 •

edited

Loading

codecov bot commented Oct 17, 2022 •

edited

Loading

atzoum Oct 18, 2022 •

edited

Loading

atzoum commented Oct 29, 2022 •

edited

Loading