Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Update async retry mechanisms #106

Merged
merged 8 commits into from
Jul 7, 2020
Merged

Update async retry mechanisms #106

merged 8 commits into from
Jul 7, 2020

Conversation

katrogan
Copy link
Contributor

@katrogan katrogan commented Jul 6, 2020

TL;DR

Add retries to notifications processor to handle underlying SQS connection failures.

In general, add user-config specified retries for initializing the initial AWS/GCP client but otherwise indefinitely retry in the case an open channel hiccups.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

flyteorg/flyte#376

Follow-up issue

NA

@@ -98,11 +98,18 @@ func NewAdminServer(kubeConfig, master string) *AdminService {
publisher := notifications.NewNotificationsPublisher(*configuration.ApplicationConfiguration().GetNotificationsConfig(), adminScope)
processor := notifications.NewNotificationsProcessor(*configuration.ApplicationConfiguration().GetNotificationsConfig(), adminScope)
go func() {
logger.Info(context.Background(), "Started processing notifications.")
err = processor.StartProcessing()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt start processing auto-handle connection loss and reconnect failures? Ofcourse it should log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, refactored @kumare3

@codecov-commenter
Copy link

codecov-commenter commented Jul 6, 2020

Codecov Report

Merging #106 into master will decrease coverage by 0.26%.
The diff coverage is 24.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #106      +/-   ##
==========================================
- Coverage   62.87%   62.60%   -0.27%     
==========================================
  Files         101      102       +1     
  Lines        7369     7418      +49     
==========================================
+ Hits         4633     4644      +11     
- Misses       2191     2228      +37     
- Partials      545      546       +1     
Flag Coverage Δ
#unittests 62.60% <24.13%> (-0.27%) ⬇️
Impacted Files Coverage Δ
pkg/async/notifications/factory.go 0.00% <0.00%> (ø)
...otifications/implementations/noop_notifications.go 0.00% <0.00%> (ø)
...g/async/notifications/implementations/processor.go 71.60% <14.28%> (-5.73%) ⬇️
pkg/async/schedule/aws/workflow_executor.go 63.33% <15.78%> (-5.07%) ⬇️
pkg/async/shared.go 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6addd4...71d4efd. Read the comment docs.

@kumare3 kumare3 self-requested a review July 6, 2020 21:59
@kumare3
Copy link
Contributor

kumare3 commented Jul 6, 2020

Looks good to me, just one thing, in case an error connection, should FlyteAdmin process not crash/exit?

@katrogan
Copy link
Contributor Author

katrogan commented Jul 6, 2020

I'm not sure - is a running admin with partial functionality better or should we fail fast?

@kumare3
Copy link
Contributor

kumare3 commented Jul 6, 2020

I'm not sure - is a running admin with partial functionality better or should we fail fast?

IMO, it could be a network partition or some other issue. If we fail fast, we might be able to recover on a different node? Also, since we have multiple replica's others should be fine.

On the other hand in sandbox mode we should probably not crash as users may not be able to figure out what the problem is. I pinged Ruslan. I guess if we do it for this, we should probably do it for all other cases?

@katrogan
Copy link
Contributor Author

katrogan commented Jul 6, 2020

Updated so that we retry initializing the initial AWS/GCP client the number of times specified in the config but otherwise indefinitely retry in the case an open channel hiccups.

@katrogan katrogan changed the title Add retries to notifications processor Update async retry mechanisms Jul 6, 2020
@@ -2,6 +2,9 @@ package notifications

import (
"context"
"time"

"github.com/lyft/flyteadmin/pkg/async"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a future flytestdlib entity

@@ -37,11 +40,18 @@ type EmailerConfig struct {
BaseURL string
}

func GetEmailer(config runtimeInterfaces.NotificationsConfig, scope promutils.Scope) interfaces.Emailer {
func GetEmailer(config runtimeInterfaces.NotificationsConfig, scope promutils.Scope,
reconnectAttempts int, reconnectDelay time.Duration) interfaces.Emailer {
switch config.Type {
case common.AWS:
awsConfig := aws.NewConfig().WithRegion(config.Region).WithMaxRetries(maxRetries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems aws config already takes retries and delay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derp, thanks for the catch. updated

kumare3
kumare3 previously approved these changes Jul 7, 2020
Copy link
Contributor

@kumare3 kumare3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you

for {
logger.Warningf(context.Background(), "Starting notifications processor")
err := p.run()
logger.Errorf(context.Background(), "error with running processor err: [%v] ", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log a metric here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already log a metric when the channel closes

@@ -168,6 +171,15 @@ func (e *workflowExecutor) formulateExecutionCreateRequest(
}

func (e *workflowExecutor) Run() {
for {
logger.Warningf(context.Background(), "Starting workflow executor")
err := e.run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this is a layer above the async auto-retry layer right? so if we exhaust all retries, then we chill for half an hour and then try again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow half an hour is a goof, meant it to be half a minute - will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a metric on channel close

Copy link
Contributor

@wild-endeavor wild-endeavor Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"async retry is for initializing clients a finite amount of times, your comment is when the channel hiccups"

// Number of times to attempt recreating a notifications processor client should there be any disruptions.
ReconnectAttempts int `json:"reconnectAttempts"`
// Specifies the time interval to wait before attempting to reconnect the notifications processor client.
ReconnectDelaySeconds int `json:"reconnectDelaySeconds"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but it's a question of how much granularity we need to expose - are people really going to configure retry delays on the order or many minutes or even hours?

@katrogan katrogan merged commit 6d15a2b into master Jul 7, 2020
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants