Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Add debug logs for terminated, scheduled workflow executor #92

Merged
merged 2 commits into from
Apr 17, 2020

Conversation

katrogan
Copy link
Contributor

@katrogan katrogan commented Apr 16, 2020

TL;DR

This change adds logging and optional reconnect attempts for the scheduled workflow executor to address flyteorg/flyte#198

This is step 1 in diagnosing the problem (it's difficult to reproduce locally). Step 2 is adding behavior to handle the observed failure causes :)

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

flyteorg/flyte#198

Follow-up issue

flyteorg/flyte#88

@katrogan
Copy link
Contributor Author

cc @rstanevich

maxReconnectAttempts := configuration.ApplicationConfiguration().GetSchedulerConfig().
WorkflowExecutorConfig.ReconnectAttempts
for reconnectAttempt := 0; reconnectAttempt < maxReconnectAttempts; reconnectAttempt++ {
time.Sleep(workflowExecutorReconnectDelay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the delay? do we need to add Jitter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but it's just an (unproven) idea. i have no idea what's causing the client connection to terminate so i figured it might be useful. if you disagree i can remove altogether

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it connect to the pubsub, on AWS SQS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

@@ -50,6 +51,8 @@ func (m *AdminService) interceptPanic(ctx context.Context, request proto.Message

const defaultRetries = 3

var workflowExecutorReconnectDelay = 30 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are making the number of reconnect attempts configurable, should we make this too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -135,8 +138,17 @@ func NewAdminServer(kubeConfig, master string) *AdminService {
scheduledWorkflowExecutor := workflowScheduler.GetWorkflowExecutor(executionManager, launchPlanManager)
logger.Info(context.Background(), "Successfully initialized a new scheduled workflow executor")
go func() {
logger.Info(context.Background(), "Starting the scheduled workflow executor")
scheduledWorkflowExecutor.Run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to remove this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to support the default where no reconnect attempts are specified, but I can if you think this is cleaner

@katrogan
Copy link
Contributor Author

PTAL @EngHabu

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@7d179a7). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #92   +/-   ##
=========================================
  Coverage          ?   63.00%           
=========================================
  Files             ?      100           
  Lines             ?     7007           
  Branches          ?        0           
=========================================
  Hits              ?     4415           
  Misses            ?     2086           
  Partials          ?      506           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d179a7...7c209eb. Read the comment docs.

@rstanevich
Copy link

@katrogan
Thanks you a lot for this fix!
So, more than one month we have no any interrupting with scheduled workflows :)

There are a few errors from Gizmo client like

{"json":{"src":"workflow_executor.go:246"},"level":"error","msg":"Gizmo subscriber closed channel with err: [RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: read tcp read: connection reset by peer]","ts":"2020-05-26T15:36:50Z"}

But subscriber restarts gracefully as I see.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants