Update async retry mechanisms #106

katrogan · 2020-07-06T21:10:17Z

TL;DR

Add retries to notifications processor to handle underlying SQS connection failures.

In general, add user-config specified retries for initializing the initial AWS/GCP client but otherwise indefinitely retry in the case an open channel hiccups.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

flyteorg/flyte#376

Follow-up issue

NA

kumare3 · 2020-07-06T21:31:30Z

pkg/rpc/adminservice/base.go

@@ -98,11 +98,18 @@ func NewAdminServer(kubeConfig, master string) *AdminService {
 	publisher := notifications.NewNotificationsPublisher(*configuration.ApplicationConfiguration().GetNotificationsConfig(), adminScope)
 	processor := notifications.NewNotificationsProcessor(*configuration.ApplicationConfiguration().GetNotificationsConfig(), adminScope)
 	go func() {
+		logger.Info(context.Background(), "Started processing notifications.")
 		err = processor.StartProcessing()


shouldnt start processing auto-handle connection loss and reconnect failures? Ofcourse it should log

sure, refactored @kumare3

codecov-commenter · 2020-07-06T21:38:01Z

Codecov Report

Merging #106 into master will decrease coverage by 0.26%.
The diff coverage is 24.13%.

@@            Coverage Diff             @@
##           master     #106      +/-   ##
==========================================
- Coverage   62.87%   62.60%   -0.27%     
==========================================
  Files         101      102       +1     
  Lines        7369     7418      +49     
==========================================
+ Hits         4633     4644      +11     
- Misses       2191     2228      +37     
- Partials      545      546       +1

Flag	Coverage Δ
#unittests	`62.60% <24.13%> (-0.27%)`	⬇️

Impacted Files	Coverage Δ
pkg/async/notifications/factory.go	`0.00% <0.00%> (ø)`
...otifications/implementations/noop_notifications.go	`0.00% <0.00%> (ø)`
...g/async/notifications/implementations/processor.go	`71.60% <14.28%> (-5.73%)`	⬇️
pkg/async/schedule/aws/workflow_executor.go	`63.33% <15.78%> (-5.07%)`	⬇️
pkg/async/shared.go	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6addd4...71d4efd. Read the comment docs.

kumare3 · 2020-07-06T22:01:35Z

Looks good to me, just one thing, in case an error connection, should FlyteAdmin process not crash/exit?

katrogan · 2020-07-06T22:03:22Z

I'm not sure - is a running admin with partial functionality better or should we fail fast?

kumare3 · 2020-07-06T22:08:21Z

I'm not sure - is a running admin with partial functionality better or should we fail fast?

IMO, it could be a network partition or some other issue. If we fail fast, we might be able to recover on a different node? Also, since we have multiple replica's others should be fine.

On the other hand in sandbox mode we should probably not crash as users may not be able to figure out what the problem is. I pinged Ruslan. I guess if we do it for this, we should probably do it for all other cases?

katrogan · 2020-07-06T23:43:47Z

Updated so that we retry initializing the initial AWS/GCP client the number of times specified in the config but otherwise indefinitely retry in the case an open channel hiccups.

kumare3 · 2020-07-07T00:29:42Z

pkg/async/notifications/factory.go

@@ -2,6 +2,9 @@ package notifications

 import (
 	"context"
+	"time"
+
+	"github.com/lyft/flyteadmin/pkg/async"


looks like a future flytestdlib entity

kumare3 · 2020-07-07T00:30:24Z

pkg/async/notifications/factory.go

@@ -37,11 +40,18 @@ type EmailerConfig struct {
 	BaseURL     string
 }

-func GetEmailer(config runtimeInterfaces.NotificationsConfig, scope promutils.Scope) interfaces.Emailer {
+func GetEmailer(config runtimeInterfaces.NotificationsConfig, scope promutils.Scope,
+	reconnectAttempts int, reconnectDelay time.Duration) interfaces.Emailer {
 	switch config.Type {
 	case common.AWS:
 		awsConfig := aws.NewConfig().WithRegion(config.Region).WithMaxRetries(maxRetries)


it seems aws config already takes retries and delay?

derp, thanks for the catch. updated

kumare3

lgtm, thank you

wild-endeavor · 2020-07-07T17:05:26Z

pkg/async/notifications/implementations/processor.go

+	for {
+		logger.Warningf(context.Background(), "Starting notifications processor")
+		err := p.run()
+		logger.Errorf(context.Background(), "error with running processor err: [%v] ", err)


should we log a metric here too?

we already log a metric when the channel closes

wild-endeavor · 2020-07-07T17:05:42Z

pkg/async/schedule/aws/workflow_executor.go

@@ -168,6 +171,15 @@ func (e *workflowExecutor) formulateExecutionCreateRequest(
 }

 func (e *workflowExecutor) Run() {
+	for {
+		logger.Warningf(context.Background(), "Starting workflow executor")
+		err := e.run()


and this is a layer above the async auto-retry layer right? so if we exhaust all retries, then we chill for half an hour and then try again?

wow half an hour is a goof, meant it to be half a minute - will update

added a metric on channel close

"async retry is for initializing clients a finite amount of times, your comment is when the channel hiccups"

wild-endeavor · 2020-07-07T17:10:49Z

pkg/runtime/interfaces/application_configuration.go

+	// Number of times to attempt recreating a notifications processor client should there be any disruptions.
+	ReconnectAttempts int `json:"reconnectAttempts"`
+	// Specifies the time interval to wait before attempting to reconnect the notifications processor client.
+	ReconnectDelaySeconds int `json:"reconnectDelaySeconds"`


should this be int? We can make it this if you want https://github.com/lyft/flytestdlib/blob/537f86093d86af270aab300742ee1a56f2905885/config/duration.go#L10

sure, but it's a question of how much granularity we need to expose - are people really going to configure retry delays on the order or many minutes or even hours?

Add retries to notifications processor

4bb6c4e

katrogan requested review from anandswaminathan, EngHabu, kumare3, matthewphsmith and wild-endeavor as code owners July 6, 2020 21:10

katrogan removed request for EngHabu, wild-endeavor, matthewphsmith and anandswaminathan July 6, 2020 21:10

kumare3 reviewed Jul 6, 2020

View reviewed changes

refactor

fa3b5ed

kumare3 self-requested a review July 6, 2020 21:59

refactoring

cad3fdc

katrogan changed the title ~~Add retries to notifications processor~~ Update async retry mechanisms Jul 6, 2020

fix test

0332978

kumare3 reviewed Jul 7, 2020

View reviewed changes

kumare3 previously approved these changes Jul 7, 2020

View reviewed changes

one more

fa02b56

katrogan dismissed kumare3’s stale review via fa02b56 July 7, 2020 17:00

katrogan added 2 commits July 7, 2020 10:01

derp derp derp

11c4614

deeeeerp

128da66

wild-endeavor reviewed Jul 7, 2020

View reviewed changes

metrics

71d4efd

wild-endeavor approved these changes Jul 7, 2020

View reviewed changes

katrogan merged commit 6d15a2b into master Jul 7, 2020

eapolinario pushed a commit that referenced this pull request Sep 6, 2023

Update async retry mechanisms (#106)

e2520eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update async retry mechanisms #106

Update async retry mechanisms #106

katrogan commented Jul 6, 2020 •

edited

Loading

kumare3 Jul 6, 2020

katrogan Jul 6, 2020

codecov-commenter commented Jul 6, 2020 •

edited

Loading

kumare3 commented Jul 6, 2020

katrogan commented Jul 6, 2020

kumare3 commented Jul 6, 2020

katrogan commented Jul 6, 2020

kumare3 Jul 7, 2020

kumare3 Jul 7, 2020

katrogan Jul 7, 2020

kumare3 left a comment

wild-endeavor Jul 7, 2020

katrogan Jul 7, 2020

wild-endeavor Jul 7, 2020

wild-endeavor Jul 7, 2020

katrogan Jul 7, 2020

katrogan Jul 7, 2020

wild-endeavor Jul 7, 2020 •

edited

Loading

wild-endeavor Jul 7, 2020

katrogan Jul 7, 2020

Update async retry mechanisms #106

Update async retry mechanisms #106

Conversation

katrogan commented Jul 6, 2020 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 6, 2020 • edited Loading

Codecov Report

kumare3 commented Jul 6, 2020

katrogan commented Jul 6, 2020

kumare3 commented Jul 6, 2020

katrogan commented Jul 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kumare3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wild-endeavor Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katrogan commented Jul 6, 2020 •

edited

Loading

codecov-commenter commented Jul 6, 2020 •

edited

Loading

wild-endeavor Jul 7, 2020 •

edited

Loading