-
Notifications
You must be signed in to change notification settings - Fork 63
Add debug logs for terminated, scheduled workflow executor #92
Conversation
cc @rstanevich |
pkg/rpc/adminservice/base.go
Outdated
maxReconnectAttempts := configuration.ApplicationConfiguration().GetSchedulerConfig(). | ||
WorkflowExecutorConfig.ReconnectAttempts | ||
for reconnectAttempt := 0; reconnectAttempt < maxReconnectAttempts; reconnectAttempt++ { | ||
time.Sleep(workflowExecutorReconnectDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the delay? do we need to add Jitter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but it's just an (unproven) idea. i have no idea what's causing the client connection to terminate so i figured it might be useful. if you disagree i can remove altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it connect to the pubsub, on AWS SQS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
pkg/rpc/adminservice/base.go
Outdated
@@ -50,6 +51,8 @@ func (m *AdminService) interceptPanic(ctx context.Context, request proto.Message | |||
|
|||
const defaultRetries = 3 | |||
|
|||
var workflowExecutorReconnectDelay = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are making the number of reconnect attempts configurable, should we make this too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@@ -135,8 +138,17 @@ func NewAdminServer(kubeConfig, master string) *AdminService { | |||
scheduledWorkflowExecutor := workflowScheduler.GetWorkflowExecutor(executionManager, launchPlanManager) | |||
logger.Info(context.Background(), "Successfully initialized a new scheduled workflow executor") | |||
go func() { | |||
logger.Info(context.Background(), "Starting the scheduled workflow executor") | |||
scheduledWorkflowExecutor.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to remove this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to support the default where no reconnect attempts are specified, but I can if you think this is cleaner
PTAL @EngHabu |
Codecov Report
@@ Coverage Diff @@
## master #92 +/- ##
=========================================
Coverage ? 63.00%
=========================================
Files ? 100
Lines ? 7007
Branches ? 0
=========================================
Hits ? 4415
Misses ? 2086
Partials ? 506 Continue to review full report at Codecov.
|
@katrogan There are a few errors from Gizmo client like
But subscriber restarts gracefully as I see. |
TL;DR
This change adds logging and optional reconnect attempts for the scheduled workflow executor to address flyteorg/flyte#198
This is step 1 in diagnosing the problem (it's difficult to reproduce locally). Step 2 is adding behavior to handle the observed failure causes :)
Type
Are all requirements met?
Complete description
How did you fix the bug, make the feature etc. Link to any design docs etc
Tracking Issue
flyteorg/flyte#198
Follow-up issue
flyteorg/flyte#88