Long running Lambda e2e tests are failing #249

tillrohrmann · 2024-01-15T09:00:31Z

Currently, our long running lambda e2e tests are failing according to our dev-alerts channel. After failing they seem to recover after a bit of time (and retries). @jackkleeman mentioned that it sometimes happens that pods are being created w/o internet connection and that this is the reason why the tests are failing and then recovering on a retry.

I think it would be great to solve this problem because it creates false positives and makes people take the long running tests that are failing less seriously as they should be taken.

The text was updated successfully, but these errors were encountered:

jackkleeman · 2024-01-15T11:43:38Z

Actually I don't think tests are failing, there are just pods restarting which is a separate alert

tillrohrmann · 2024-01-19T09:04:08Z

@jackkleeman I am seeing these e2e tests failing repeatedly. Any new ideas how to fix the problem?

jackkleeman · 2024-01-19T15:49:27Z

@tillrohrmann the tests are not failing! it is just a restarting pods alert - its an alert i added a few days back to your suggestion re detecting panics in restate. but the tests succeed despite the restart. i didnt have the bandwidth for this this week, but i can remove this alert if you like. most likely the whole infrastructure is about to change and its not worth investigating the network partitions that are causing these alerts

tillrohrmann · 2024-01-19T17:33:58Z

So because of the missing internet connection, the binary is panicking and on Restart it usually gets resolved? Or is there an easy way to distinguish between the "no internet connection" case and a "Restate panic"? Maybe a pragmatic solution could be to disable these alerts for the lambda tests where the problem with the internet connection can arise.

jackkleeman · 2024-01-20T09:20:33Z

it appears that the binary eventually exits, yes, and after some restarts it is resolved. it is not possible without parsing logs to figure out what caused a restate binary to restart, no

slinkydeveloper · 2024-02-13T14:24:02Z

is this still relevant?

tillrohrmann · 2024-02-13T15:54:49Z

We haven't improved the situation yet. So I would say, it is still relevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running Lambda e2e tests are failing #249

Long running Lambda e2e tests are failing #249

tillrohrmann commented Jan 15, 2024 •

edited

Loading

jackkleeman commented Jan 15, 2024

tillrohrmann commented Jan 19, 2024

jackkleeman commented Jan 19, 2024

tillrohrmann commented Jan 19, 2024

jackkleeman commented Jan 20, 2024

slinkydeveloper commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

Long running Lambda e2e tests are failing #249

Long running Lambda e2e tests are failing #249

Comments

tillrohrmann commented Jan 15, 2024 • edited Loading

jackkleeman commented Jan 15, 2024

tillrohrmann commented Jan 19, 2024

jackkleeman commented Jan 19, 2024

tillrohrmann commented Jan 19, 2024

jackkleeman commented Jan 20, 2024

slinkydeveloper commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Jan 15, 2024 •

edited

Loading