Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [FlyteAdmin] Notifications SQS subscriber stops processing messages when "connection reset by peer" #376

Closed
2 of 20 tasks
rstanevich opened this issue Jul 3, 2020 · 8 comments
Assignees
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Milestone

Comments

@rstanevich
Copy link
Contributor

rstanevich commented Jul 3, 2020

Describe the bug
Notifications SQS subscriber stopped process messages

Expected behavior
Gracefully reconnecting if the application is running

Flyte component

  • Overall
  • Flyte Setup and Installation scripts
  • Flyte Documentation
  • Flyte communication (slack/email etc)
  • FlytePropeller
  • FlyteIDL (Flyte specification language)
  • Flytekit (Python SDK)
  • FlyteAdmin (Control Plane service)
  • FlytePlugins
  • DataCatalog
  • FlyteStdlib (common libraries)
  • FlyteConsole (UI)
  • Other

To Reproduce
Steps to reproduce the behavior:

  1. Run flyte with enabled notifications
  2. Wait for this happens

Environment
Flyte component

  • Sandbox (local or on one machine)
  • Cloud hosted
    • AWS
    • GCP
    • Azure
  • Baremetal
  • Other

Additional context
Logs:

{"json":{"src":"base.go:103"},"level":"error","msg":"error with starting processor err: [RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: read tcp 10.200.8.116:59882-\u003e52.46.137.144:443: read: connection reset by peer] ","ts":"2020-07-03T10:35:10Z"}
{"json":{"src":"processor.go:113"},"level":"warning","msg":"The stream for the subscriber channel closed with err: RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: read tcp 10.200.8.116:59882-\u003e52.46.137.144:443: read: connection reset by peer","ts":"2020-07-03T10:35:10Z"}

I guess solution will be similar to this one: flyteorg/flyteadmin#92

@rstanevich rstanevich added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 3, 2020
@kumare3
Copy link
Contributor

kumare3 commented Jul 5, 2020

@rstanevich thank you for opening up the bug, I will add it to the next milestone. Do you think thats ok, or is it affecting everyday and we should fix it ASAP?

@kumare3 kumare3 added this to the 0.6.0 milestone Jul 5, 2020
@katrogan
Copy link
Contributor

katrogan commented Jul 6, 2020

thanks for opening @rstanevich - is this a recent issue?

@rstanevich
Copy link
Contributor Author

thank you @kumare3 and @katrogan

Honestly, we noticed it first time about notifications SQS subscriber.
We are running flyteadmin in multiple replicas and we have alert for the oldest message in SQS. Then we can just restart flyteadmin pods.
So, it looks not urgent at least for us.

@kumare3
Copy link
Contributor

kumare3 commented Jul 6, 2020

thank you @kumare3 and @katrogan

Honestly, we noticed it first time about notifications SQS subscriber.
We are running flyteadmin in multiple replicas and we have alert for the oldest message in SQS. Then we can just restart flyteadmin pods.
So, it looks not urgent at least for us.

Thank you for the reply. BTW, we have Flyte's very own Flash @katrogan who already has a PR for this

@kumare3
Copy link
Contributor

kumare3 commented Jul 6, 2020

@rstanevich do you have an alert if FlyteAdmin crashes? What would be preferrable, that you notice a FlyteAdmin crash or it continues to limp along and you notice older messages?

@rstanevich
Copy link
Contributor Author

@kumare3
Yes, we have alerts on crashing any pods in cluster.
Do you mean restarting container if sqs subscriber stopped?
actually, I don't know what is better. It is pretty good now... maybe I think connection retry should be executed unlimited times
with some interval.

@kumare3
Copy link
Contributor

kumare3 commented Jul 7, 2020

@rstanevich I think we found a very good way of solving this problem. @katrogan will merge the PR soon. Thank you for raising the issue.

@kumare3
Copy link
Contributor

kumare3 commented Jul 7, 2020

it is merged and will be part of the next release

@kumare3 kumare3 closed this as completed Jul 7, 2020
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
* Better Error

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* lint

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* fix unit test

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Add defensive nil checks

Signed-off-by: Haytham Abuelfutuh <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
…#376)

* wip

Signed-off-by: Katrina Rogan <[email protected]>

* add a test too

Signed-off-by: Katrina Rogan <[email protected]>

* Matchable attribute impl

Signed-off-by: Katrina Rogan <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* Add missing in_container.mk

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Fix serialization for greatexpectations

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* fix mnist classifier examples

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Update requirements for kfpytorch

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Update kfpytorch requirements

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Update sql requirements

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Enable pytorch sagemaker image build

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* tidy names

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Try to silence TERM errors

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Install libssl1.0.0

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Updates to pytorch images

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Try updates

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* cleanup

Signed-off-by: Haytham Abuelfutuh <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023
* Better Error

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* lint

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* fix unit test

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Add defensive nil checks

Signed-off-by: Haytham Abuelfutuh <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
…#376)

* wip

Signed-off-by: Katrina Rogan <[email protected]>

* add a test too

Signed-off-by: Katrina Rogan <[email protected]>

* Matchable attribute impl

Signed-off-by: Katrina Rogan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
Development

No branches or pull requests

3 participants