-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long running Lambda e2e tests are failing #249
Comments
Actually I don't think tests are failing, there are just pods restarting which is a separate alert |
@jackkleeman I am seeing these e2e tests failing repeatedly. Any new ideas how to fix the problem? |
@tillrohrmann the tests are not failing! it is just a restarting pods alert - its an alert i added a few days back to your suggestion re detecting panics in restate. but the tests succeed despite the restart. i didnt have the bandwidth for this this week, but i can remove this alert if you like. most likely the whole infrastructure is about to change and its not worth investigating the network partitions that are causing these alerts |
So because of the missing internet connection, the binary is panicking and on Restart it usually gets resolved? Or is there an easy way to distinguish between the "no internet connection" case and a "Restate panic"? Maybe a pragmatic solution could be to disable these alerts for the lambda tests where the problem with the internet connection can arise. |
it appears that the binary eventually exits, yes, and after some restarts it is resolved. it is not possible without parsing logs to figure out what caused a restate binary to restart, no |
is this still relevant? |
We haven't improved the situation yet. So I would say, it is still relevant. |
Currently, our long running lambda e2e tests are failing according to our
dev-alerts
channel. After failing they seem to recover after a bit of time (and retries). @jackkleeman mentioned that it sometimes happens that pods are being created w/o internet connection and that this is the reason why the tests are failing and then recovering on a retry.I think it would be great to solve this problem because it creates false positives and makes people take the long running tests that are failing less seriously as they should be taken.
The text was updated successfully, but these errors were encountered: