-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread local is not cleaned up sometimes #2930
Comments
It's hard to tell what exactly is wrong without seeing the whole coroutine's hierarchy. What I suspect can be a root cause is a 3rd-party implementation of coroutine builder that does not implement #985 uses stackwalking capabilities, leveraging When the exception is thrown, can you please check if the coroutine context in the most nested coroutine contains |
Thanks for the response. I'll check that. |
I've checked presence of the |
Hello. We are getting something like this after few days in production... We have a loop like this:
Also, all builders are pretty standard. Maybe some tricks with cancellation\exceptions\etc... 1.5.1 version. And of course, we have a lot of code like this:
|
Workaround like this:
fixed the issue in our case. |
Still reproducible on 1.6.0. |
Finally I've managed to write a small reproducer: val threadLocal = ThreadLocal<String>()
suspend fun main() {
while (true) {
coroutineScope {
repeat(100) {
launch {
doSomeJob()
}
}
}
}
}
private suspend fun doSomeJob() {
check(threadLocal.get() == null)
withContext(threadLocal.asContextElement("foo")) {
val semaphore = Semaphore(1, 1)
suspendCancellableCoroutine<Unit> { cont ->
Dispatchers.Default.asExecutor().execute {
cont.resume(Unit)
}
}
cancel()
semaphore.acquire()
}
} It completes almost instantly on my machine and takes some time on play.kotlinlang.org. |
Great job with a reproducer! Verified it reproduces, we'll fix it in 1.6.1 |
… in order to avoid state interference when the coroutine is updated concurrently. Concurrency is inevitable in this scenario: when the coroutine that has UndispatchedCoroutine as its completion suspends, we have to clear the thread context, but while we are doing so, concurrent resume of the coroutine could've happened that also ends up in save/clear/update context Fixes #2930
Is there a planned release date for 1.6.1? |
Unfortunately, the issue seems to be still there. The following code throws an exception on versions from 1.5.0 till current develop branch (262876b): val threadLocal = ThreadLocal<String>()
suspend fun main() {
doSomeJob()
doSomeJob()
}
private suspend fun doSomeJob() {
check(threadLocal.get() == null)
withContext(threadLocal.asContextElement("foo")) {
try {
coroutineScope {
val semaphore = Semaphore(1, 1)
dummyAwait()
cancel()
semaphore.acquire()
}
} catch (e: CancellationException) {
println("cancelled")
}
}
}
private suspend fun dummyAwait() {
CompletableFuture.runAsync({ }, Dispatchers.Default.asExecutor()).await()
} |
Could you please recheck on 1.6.1? I cannot reproduce it as is, I will give a few tries a bit later to see if it still reproduces. Anyway, 1.6.1 fixes at least one serious bug in thread locals, so it's worth upgrading |
The same on 1.6.1, every time. Checked on Liberica JDK 11.0.14 and some build of OpenJDK 17. |
Aha, I see, it only reproduces with non I'll fix it separately. Meanwhile, it would be nice to see if you are still affected in the production environment as it's unlikely to be the case that someone has Depending on that we'll decide on an urgency of the fix |
Indeed, The potential production case is an application based on Ktor 1.6.8. When using We have some hooks in the infrastructure to ensure that code runs with an initial value of |
The reproducer for Ktor does not differ much: val threadLocal = ThreadLocal<String>()
fun main() {
val engine = embeddedServer(Netty, port = 8080) {
routing {
get {
doSomeJob()
doSomeJob()
}
}
}
engine.start()
}
private suspend fun doSomeJob() {
check(threadLocal.get() == null)
withContext(threadLocal.asContextElement("foo")) {
try {
coroutineScope {
val semaphore = Semaphore(1, 1)
dummyAwait()
cancel()
semaphore.acquire()
}
} catch (e: CancellationException) {
println("cancelled")
}
}
}
private suspend fun dummyAwait() {
CompletableFuture.runAsync({ }, Dispatchers.Default.asExecutor()).await()
} |
Thanks for both Ktor and regular reproducer! The source of the issue is indeed non- I have a potential solution in mind (#3252) and also future-proof plan to avoid similar problems (#3253), I believe this issue itself is enough to release 1.6.2 with a fix, though I cannot give you a strict timeline here |
…ercepted with DispatchedContinuation Fixes #2930
Kotlin#3155) * Confine context-specific state to the thread in UndispatchedCoroutine in order to avoid state interference when the coroutine is updated concurrently. Concurrency is inevitable in this scenario: when the coroutine that has UndispatchedCoroutine as its completion suspends, we have to clear the thread context, but while we are doing so, concurrent resume of the coroutine could've happened that also ends up in save/clear/update context Fixes Kotlin#2930
Kotlin#3155) * Confine context-specific state to the thread in UndispatchedCoroutine in order to avoid state interference when the coroutine is updated concurrently. Concurrency is inevitable in this scenario: when the coroutine that has UndispatchedCoroutine as its completion suspends, we have to clear the thread context, but while we are doing so, concurrent resume of the coroutine could've happened that also ends up in save/clear/update context Fixes Kotlin#2930
…ercepted with DispatchedContinuation (Kotlin#3252) * Properly preserve thread local values for coroutines that are not intercepted with DispatchedContinuation Fixes Kotlin#2930
Just want to notify - KTOR users are still affected by the issue even on latest 1.6.4 version ( |
@michail-nikolaev could you please share a reproducer? |
@qwwdfsad I have updated repo with new versions and new test based on Aleksei Tirman provided. https://github.com/michail-nikolaev/kotlin-coroutines-thread-local/blob/master/test/SuspendFunctionGunTest.kt (and copy past of KTOR in https://github.com/michail-nikolaev/kotlin-coroutines-thread-local/blob/master/test/CopyPast.kt) |
I'll investigate it, thanks |
Preliminary findings are that the problem is caused by |
Here is the patch to make a reproducer from https://github.com/michail-nikolaev/kotlin-coroutines-thread-local/ work (NB: reproducer is non-deterministic and happen to break only under debugger due to its timing-sensitive nature)
We discussed it internally and figured out that this is the Ktor-sided problem (https://youtrack.jetbrains.com/issue/KTOR-2644/) that will be taken care of by Ktor team. Closing as a third-party problem. |
Hi! Here I am again. The following code fails on 1.6.4:
|
Thanks! More deterministic repro:
|
I'm working on that, the ETA is 1.7.0-Beta|RC, the fix is unfortunately far from being trivial and can be basically boil down to #3253 |
A small update: there are two bugs: one in The bug in Ktor is fixed in |
Just want to inform - we still see the issue in production with KTOR 2.2.2. But |
@qwwdfsad Could you please provide some insight about the state of the issue and plans? We have to avoid coroutines in our server-side code because of it. |
Thanks for the reminder, no updates yet. We'll see if it's manageable in the scope of 1.8.0 |
Any updates on this issue? We are using version 1.9.0 of coroutines lib and it looks, that Ktor issues related to these bugs are still inplace. It's quite long time since the issue has been raised. |
Looks like Ktor 3.0.0 resolves this issue - at least my test doesn't fail anymore. |
I'm using a
ThreadContextElement
that sets value of aThreadLocal
. After resolving of #985 it worked perfectly.But after upgrade to 1.5.0 I've got a similar problem: sometimes the last value of the thread local stucks in a worker thread.
Equivalent code:
Actual code of the
ThreadContextElement
implementation is here.It is hard to reproduce the issue, but I'm facing it periodically in production (it may take hours or days to arise).
Tested 1.5.0 and 1.5.2, both behaves the same. Running it with
-ea
.The text was updated successfully, but these errors were encountered: