Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SocketTimeoutExceptions from WebSockets connections isn't handled #302

Open
gnarea opened this issue Oct 3, 2023 · 10 comments
Open

SocketTimeoutExceptions from WebSockets connections isn't handled #302

gnarea opened this issue Oct 3, 2023 · 10 comments
Assignees

Comments

@gnarea
Copy link
Member

gnarea commented Oct 3, 2023

The following doesn't appear to have any effect (SocketTimeoutException is a subclass of IOException):

} catch (exc: IOException) {
logger.info { "Server is unreachable (${exc.message})" }
throw ServerConnectionException("Server is unreachable", exc)

Because Letro is crashing when the connection to the Awala server stops responding to pings:

E  FATAL EXCEPTION: AwalaManagerMessageReceiverThread
Process: tech.relaycorp.letro, PID: 14200
java.net.SocketTimeoutException: sent ping but didn't receive pong within 5000ms (after 1 successful ping/pongs)
	at okhttp3.internal.ws.RealWebSocket.writePingFrame$okhttp(RealWebSocket.kt:563)
	at okhttp3.internal.ws.RealWebSocket$initReaderAndWriter$lambda-3$$inlined$schedule$1.runOnce(TaskQueue.kt:219)
	at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.kt:116)
	at okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.kt:42)
	at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:65)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.lang.Thread.run(Thread.java:923)
	Suppressed: kotlinx.coroutines.internal.DiagnosticCoroutineContextException: [StandaloneCoroutine{Cancelling}@a67a425, java.util.concurrent.ScheduledThreadPoolExecutor@bc259fa[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 4]]
@sdsantos
Copy link
Collaborator

From what I reproduced, it was actually working and wrapping in a ServerConnectionException. But now this requirements changed, since we added separate handling for SocketTimeoutException.

@gnarea
Copy link
Member Author

gnarea commented Oct 11, 2023

Indeed. Though it's interesting that wsConnect() isn't in the stack trace, so I guess that's why we got the uncaught exception. Maybe it's something we need to handle at the flow level?

@sdsantos
Copy link
Collaborator

Yup, it should be something like that. I'll investigate tomorrow.

@sdsantos
Copy link
Collaborator

From what I can tell it's being handled at the Flow level. Any steps to reproduce this issue? I haven't been able to.

I also noticed that the PDC interface is outdated. These throws come from inside the Flow, not from the method call: https://github.com/relaycorp/awala-jvm/blob/bb426493379c605eec0a0c628f8c1dab82f5f6ad/src/main/kotlin/tech/relaycorp/relaynet/bindings/pdc/PDCClient.kt#L54

@gnarea
Copy link
Member Author

gnarea commented Oct 12, 2023

From what I can tell it's being handled at the Flow level. Any steps to reproduce this issue? I haven't been able to.

Handled at the flow level by whom?

Not even disconnecting the router from the Internet did the trick?

I haven't experienced this myself. @mgulyaev10, any idea on how to reproduce this? I suspect it may be intermittent and happen rarely.

I also noticed that the PDC interface is outdated. These throws come from inside the Flow, not from the method call: https://github.com/relaycorp/awala-jvm/blob/bb426493379c605eec0a0c628f8c1dab82f5f6ad/src/main/kotlin/tech/relaycorp/relaynet/bindings/pdc/PDCClient.kt#L54

Thanks! Fixed in relaycorp/awala-jvm#300

@sdsantos
Copy link
Collaborator

From what I can tell it's being handled at the Flow level. Any steps to reproduce this issue? I haven't been able to.

Handled at the flow level by whom?

Not even disconnecting the router from the Internet did the trick?

That only triggered a wrapped ServerConnectionException that is being handled.

@sdsantos
Copy link
Collaborator

Looks like KTOR as an history of uncatchable exceptions that don't reference the app source code. Here's at least once that mentions SocketTimeoutException, although with a different stacktrace: https://youtrack.jetbrains.com/issue/KTOR-577/java.net.SocketTimeoutException-with-no-lines-referencing-my-code

Since our stacktraces also don't mention our own source code, we might not be able to handle the exception regularly. Maybe we can fix the underlying issue causing the timeout, which might not be possible. Other solutions look to be pretty last resort like CoroutineExceptionHandler or Thread.setDefaultUncaughtExceptionHandler, which don't sound great.

@gnarea
Copy link
Member Author

gnarea commented Dec 14, 2023

Thanks for looking into this @sdsantos!

That doesn't look good :(

I guess even if we wanted to use CoroutineExceptionHandler or Thread.setDefaultUncaughtExceptionHandler, we still wouldn't know for sure if that SocketTimeoutException came from our Ktor client anyway, so it wouldn't really work.

@sdsantos
Copy link
Collaborator

I guess even if we wanted to use CoroutineExceptionHandler or Thread.setDefaultUncaughtExceptionHandler, we still wouldn't know for sure if that SocketTimeoutException came from our Ktor client anyway, so it wouldn't really work.

At least with CoroutineExceptionHandler you can wrap it around a specific part of the code. But it could be catching more than we want yes.

@gnarea
Copy link
Member Author

gnarea commented Dec 14, 2023

That's a bit better then, but I'd be concerned with the code complexity to restart/end the WebSocket connection when that happens.

Since this has only happened once in Letro in the past 90 days, let's put a pin on this for now and revisit if it happens more often. Hopefully that'll buy enough time until Jet Brains fix this.

@gnarea gnarea removed the status in Android maintenance Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants