Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error in gRPC coroutines stack caused by "An error occurred while pushing a context" #3945

Closed
smax48 opened this issue Nov 23, 2021 · 16 comments

Comments

@smax48
Copy link

smax48 commented Nov 23, 2021

Today we had a very strange fatal crash in one of our server instances (hosted in ECS/Fargate).
We had to restart the instance, otherwise it kept crashing...

Our stack is Armeria 1.13.1 / kotlin 1.5.31; kotlin coroutines 1.5.2; gRPC kotlin 1.2.0; gRPC java 1.41.0;.
The gRPC service is based on kotlin coroutines with armeria blocking task executor.

Seems that the initial exception that was recorded in logs, was "An error occurred while pushing a context":

java.lang.IllegalStateException: Trying to call object wrapped with context [sreqId=e4462b39, chanId=72864b63, raddr=10.204.22.170:49513, laddr=10.204.21.148:8081][h2c://ip-10-204-21-148.ap-southeast-2.compute.internal/<service and operation name that failed>#POST], but context is currently set to [sreqId=5f01fe01, chanId=966cfccd, raddr=10.204.22.170:54644, laddr=10.204.21.148:8081][h1c://ip-10-204-21-148.ap-southeast-2.compute.internal/<full service and operation name>t#POST]. This means the callback was called from unexpected thread or forgetting to close previous context.\
at com.linecorp.armeria.internal.common.RequestContextUtil.newIllegalContextPushingException(RequestContextUtil.java:139)\
at com.linecorp.armeria.server.ServiceRequestContext.push(ServiceRequestContext.java:234)\
at com.linecorp.armeria.common.RequestContext.lambda$makeContextAware$3(RequestContext.java:546)\
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)\
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)\
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)\
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\
at java.base/java.lang.Thread.run(Unknown Source)\

After that, we had hundreds of coroutine-related exceptions - all caused by the same wrong context (sreqId=5f01fe01):

CoroutinesInternalError
Fatal exception in coroutines machinery for DispatchedContinuation[com.linecorp.armeria.common.util.DefaultBlockingTaskExecutor@34e9f91f, Continuation at io.grpc.kotlin.ServerCalls$serverCallListener$1.invokeSuspend(ServerCalls.kt)@61c5d0c6]. Please read KDoc to 'handleFatalException' method and report this incident to maintainers

....

IllegalStateException
Trying to call object wrapped with context [sreqId=067cd880, chanId=217d479a, raddr=10.204.21.9:22674, laddr=10.204.21.148:8081][h2c://ip-10-204-21-148.ap-southeast-2.compute.internal/<service and operation name>#POST], but context is currently set to [sreqId=5f01fe01, chanId=966cfccd, raddr=10.204.22.170:54644, laddr=10.204.21.148:8081][h1c://ip-10-204-21-148.ap-southeast-2.compute.internal/<the same original service and operation name>#POST]. This means the callback was called from unexpected thread or forgetting to close previous context.

com.linecorp.armeria.internal.common.RequestContextUtil in newIllegalContextPushingException at line 139
com.linecorp.armeria.server.ServiceRequestContext in push at line 234
com.linecorp.armeria.server.grpc.ArmeriaRequestCoroutineContext in updateThreadContext at line 46
com.linecorp.armeria.server.grpc.ArmeriaRequestCoroutineContext in updateThreadContext at line 30
kotlinx.coroutines.internal.ThreadContextKt$updateState$1 in invoke at line 54
kotlinx.coroutines.internal.ThreadContextKt$updateState$1 in invoke at line 52
kotlin.coroutines.CoroutineContext$Element$DefaultImpls in fold at line 70
kotlin.coroutines.AbstractCoroutineContextElement in fold at line 15
kotlin.coroutines.CombinedContext in fold at line 131
2
kotlinx.coroutines.internal.ThreadContextKt in updateThreadContext at line 72
kotlinx.coroutines.DispatchedTask in run at line 224
java.util.concurrent.Executors$RunnableAdapter in call
java.util.concurrent.FutureTask in run
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask in run
java.util.concurrent.ThreadPoolExecutor in runWorker
java.util.concurrent.ThreadPoolExecutor$Worker in run
io.netty.util.concurrent.FastThreadLocalRunnable in run at line 30
java.lang.Thread in run
@minwoox
Copy link
Contributor

minwoox commented Nov 24, 2021

was "An error occurred while pushing a context":

The exception is raised when a thread tries to push a ReqeustContext to its thread-local while it has another RequestContext in its thread-local.
The RequestContext in thread-local should be popped out before another RequestContext is pushed but I guess it's not.
Could you provide a simple reproducible example so I can find out the cause, please? 🤔

@smax48
Copy link
Author

smax48 commented Nov 24, 2021

I wish I could, but I have no idea how that happened. The service was running in production for a while before this :(.
We have multiple grpc endpoints for this service. After this issue happened, the only solution was to restart that crashing instance ... which is very bad.

I can only guess that there's a race condition somewhere, maybe related with operation/coroutines cancellation - some requests running at the same time could be timing out - not sure if that caused something wrong.

@minwoox
Copy link
Contributor

minwoox commented Nov 24, 2021

Then, I probably need to take a look at your code. Let's have a private chat in Slack. Could you send me DM in Slack?

@smax48
Copy link
Author

smax48 commented Nov 24, 2021

sure!

@trustin
Copy link
Member

trustin commented Mar 28, 2022

@smax48 @minwoox Was the problem solved after all?

@onsah
Copy link

onsah commented May 10, 2022

I am also encountering a similar error recently. How can I solve?

@ikhoon
Copy link
Contributor

ikhoon commented May 12, 2022

@onsah Could you provide a reproducible example?

@ikhoon
Copy link
Contributor

ikhoon commented May 13, 2022

If it is difficult to reproduce, #4232 feature will help us find a clue to the error.

@kwazii1231
Copy link

kwazii1231 commented Oct 6, 2022

I also encountering this issue
and i can reproduce this.

i have a kind of spring service with suspend function that looks like this

suspend fun createGoods(input: Input) = withContext(Dispatchers.IO){

val aAsync = async { insert to db }
val bAsync = async { insert to db }

aAsync.await()
bAsync.await()

return@withContext Something()
}

and i call this from Grpc api many time at once.

sometimes the same error happend. but sometimes not

i think when server get too much request, this happens

i just saw related commit for detect context leak and i will try that!

@ikhoon
Copy link
Contributor

ikhoon commented Oct 6, 2022

Thanks for the reporting! I have been looking forward to more information on the bug.

I’m not an expert on the internal of Kotlin coroutine. I need to debug the code to find out the cause.
So it would be more helpful if you can provide a runnable example. 🙇‍♂️

@kwazii1231
Copy link

kwazii1231 commented Oct 9, 2022

finally i reproduce this.
kotlin + grpc + withContext(Dispatcher.IO) + too much requeset or some blocking action (like delay)
i try to figure out what is the main problem. but little bit confused. (i'm not expert of armeria and some interface is hard to understand). so i want to doing some private chat with who has good knowledge of ArmeriaCoroutineContextInterceptor , ArmeriaRequestCoroutineContext and NioEventLoop of Netty

if anyone not available i want to explain how to reproduce. (ilttle bit hard to do in english, i'm not good to explain)
please let me know how to handle this issue! thanks

@ikhoon
Copy link
Contributor

ikhoon commented Oct 9, 2022

What langugage do you prefer? We have Korean, Japanese and English channels in the Slack. https://armeria.dev/s/slack
If you want to have a private chat, you can find the same ID with my GitHub ID in the slack workspace. I’m not sure I can speak your native language though. 😅

@kwazii1231
Copy link

What langugage do you prefer? We have Korean, Japanese and English channels in the Slack. https://armeria.dev/s/slack If you want to have a private chat, you can find the same ID with my GitHub ID in the slack workspace. I’m not sure I can speak your native language though. 😅

i'm korean!!

@ikhoon
Copy link
Contributor

ikhoon commented Oct 9, 2022

I see. Let’s continute to have conversation in Korean at the Slack.
I will summarize some non-private information that would be good to share with other users.

@kwazii1231
Copy link

kwazii1231 commented Oct 9, 2022

after trying to reproduce this issue with @ikhoon
i found very mandatory condition of this.
if project use coroutine version below, this issue can be reproduce
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core-jvm:1.5.x")

in my case, my project use spring version 2.6.3 and gradle plugin for managing dependency
plugin automatically downloaded "org.jetbrains.kotlinx:kotlinx-coroutines-core-jvm:1.5.2"
after forced change version to 1.6.4 or upgrade spring boot version to 2.7.x
it didn't happen anymore.

kotlin 1.6.1 release note contain fix of this issue

Rollback of time-related functions dispatching on Dispatchers.Main.
This behavior was introduced in 1.6.0 and then found inconvenient and erroneous (Kotlin/kotlinx.coroutines#3106, Kotlin/kotlinx.coroutines#3113).
Reworked the newly-introduced CopyableThreadContextElement to solve issues uncovered after the initial release (Kotlin/kotlinx.coroutines#3227).
Fixed a bug with ThreadLocalElement not being properly updated in racy scenarios (Kotlin/kotlinx.coroutines#2930).
Reverted eager loading of default CoroutineExceptionHandler that triggered ANR on some devices (Kotlin/kotlinx.coroutines#3180).
New API to convert a CoroutineDispatcher to a Rx scheduler (Kotlin/kotlinx.coroutines#968, Kotlin/kotlinx.coroutines#548). Thanks @recheej!
Fixed a memory leak with the very last element emitted from flow builder being retained in memory (Kotlin/kotlinx.coroutines#3197).
Fixed a bug with limitedParallelism on K/N with new memory model throwing ClassCastException (Kotlin/kotlinx.coroutines#3223).
CoroutineContext is added to the exception printed to the default CoroutineExceptionHandler to improve debuggability (Kotlin/kotlinx.coroutines#3153).
Static memory consumption of Dispatchers.Default was significantly reduced (Kotlin/kotlinx.coroutines#3137).
Updated slf4j version in kotlinx-coroutines-slf4j from 1.7.25 to 1.7.32.

maybe leak is related with one of above issues
i think this one is related Kotlin/kotlinx.coroutines#2930

@ikhoon
Copy link
Contributor

ikhoon commented Oct 12, 2022

Thanks, @kwazii1231 for the thorough analysis.
I agree that Kotlin/kotlinx.coroutines#2930 caused the problem and it is fixed by https://github.com/Kotlin/kotlinx.coroutines/releases/tag/1.6.2.

@ikhoon ikhoon closed this as completed Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants