-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSAN performance degradation in LLVM 14 #1552
Comments
Grepping out all application blocked threads I got this: There are 2 cases with __tsan::RestoreStack and lots of __sanitizer::FutexWait (what you described). Plus the first case with lots of __sanitizer::internal_sched_yield. This looks like DenseAllocator contention. For the rest I don't see anything suspicious/can't explain. For now I would assume that the contention just cleared up by the time the stack traces were taken. |
Having 700 threads should not be a problem on its own. We apply tsan to programs with tens of thousands of threads. |
It may be an artifact of the new tsan algorithm. This should not happen frequently, but infrequent cases are expected. |
The new algorithm is significantly different. There are lots of trade-offs involved and exact difference in precision is very complex. I hope to give a public talk/video of the exact differences soon. |
Thank you for the quick response!
I'm not familiar with this code, but it seems quite unrelated to the problem because all these calls were found in stacktraces collected at 2022-07-14 02:28:27. Also as far as I understand logs of the server, it slowed down later at about 2022.07.14 02:40.
Within these 17 minutes, 3 stacktraces snapshots were taken. Each one stops the whole server for about 90 seconds, so 12.5 minutes would be a better estimation, but anyway it's extremely long. On average this test (not only this query, but the whole test that contains it) with thread sanitizer takes 2.3 minutes. Here you can see statistics for this test runs in ClickHouse master: Also due to
Right, there are no reports from sanitizer. All the information we have about this test run in CI can be found here: https://s3.amazonaws.com/clickhouse-test-reports/39194/45dfaa3491ff43774463449a3060df631479235c/stateless_tests__thread__actions__%5B1/3%5D.html |
Is there anything I can do to provide more information about this situation? Would it be useful if I build clang with |
I see. Does it happen in every run on this test now? Or just in that one run? How reproducible is it?
TSAN_DEBUG_OUTPUT mode logs 1 line per memory access in the program, so if it runs for minutes, it will be zetabytes of logs... Any chance I can reproduce it locally? That would be the fastest to debug. |
I've sent https://reviews.llvm.org/D130002 for the __sanitizer::internal_sched_yield issue. Even if it's not the main one, it's still an issue and it's easily reproducible on a synthetic benchmark. So no reason not to fix. |
It happens from time to time. We see it once in a couple of days. There are the statistics of such test runs in our CI since migration: We had broken master several days, so I'd say it happened ~40-50 times since April.
It seems for me that it's hard to reproduce locally for now. We are going to try to reproduce it on our side (We suspect it can be related to aggregation/many memory accesses to close memory addresses). Also I want to try to attach perf to the server process during time between the stacktrace collection, maybe it'd help us to understand what happens during this period of time. |
After removing check_duration_ms constraint I got 16411, so it seems that it's failing 1 time out of ~16411/66=250. |
Sorry for the poor description. The server slows down at random moments, so it affects random tests. So I counted the number of times when during all tests running this problem happened. It means it happens once in 250-300 commits in our repository (on average). It takes about 45 minutes on average to run all tests with TSAN. 45 * 250 / 60 = 187.5 hours or 7-8 days. |
7-8 days explains why I couldn't trigger it :) I've tried to create absolutely worst case conditions for what I think happens: and it indeed triggers spurious ReportRace, but the most I can get is 10s of milliseconds delay in ReportRace:
|
More stack traces when such stall happens may shed some light. But otherwise I am out of ideas. |
I think I managed to reproduce something similar. It involves a producer thread and lots of consumer threads that read the same data. I've left my debugging there for future reference: |
Prevent the following pathological behavior: Since memory access handling is not synchronized with DoReset, a thread running concurrently with DoReset can leave a bogus shadow value that will be later falsely detected as a race. For such false races RestoreStack will return false and we will not report it. However, consider that a thread leaves a whole lot of such bogus values and these values are later read by a whole lot of threads. This will cause massive amounts of ReportRace calls and lots of serialization. In very pathological cases the resulting slowdown can be >100x. This is very unlikely, but it was presumably observed in practice: google/sanitizers#1552 If this happens, previous access sid+epoch will be the same for all of these false races b/c if the thread will try to increment epoch, it will notice that DoReset has happened and will stop producing bogus shadow values. So, last_spurious_race is used to remember the last sid+epoch for which RestoreStack returned false. Then it is used to filter out races with the same sid+epoch very early and quickly. It is of course possible that multiple threads left multiple bogus shadow values and all of them are read by lots of threads at the same time. In such case last_spurious_race will only be able to deduplicate a few races from one thread, then few from another and so on. An alternative would be to hold an array of such sid+epoch, but we consider such scenario as even less likely. Note: this can lead to some rare false negatives as well: 1. When a legit access with the same sid+epoch participates in a race as the "previous" memory access, it will be wrongly filtered out. 2. When RestoreStack returns false for a legit memory access because it was already evicted from the thread trace, we will still remember it in last_spurious_race. Then if there is another racing memory access from the same thread that happened in the same epoch, but was stored in the next thread trace part (which is still preserved in the thread trace), we will also wrongly filter it out while RestoreStack would actually succeed for that second memory access. Reviewed By: melver Differential Revision: https://reviews.llvm.org/D130269
The tentative fix is landed. Please re-open if it still happens with the fix. |
Prevent the following pathological behavior: Since memory access handling is not synchronized with DoReset, a thread running concurrently with DoReset can leave a bogus shadow value that will be later falsely detected as a race. For such false races RestoreStack will return false and we will not report it. However, consider that a thread leaves a whole lot of such bogus values and these values are later read by a whole lot of threads. This will cause massive amounts of ReportRace calls and lots of serialization. In very pathological cases the resulting slowdown can be >100x. This is very unlikely, but it was presumably observed in practice: google/sanitizers#1552 If this happens, previous access sid+epoch will be the same for all of these false races b/c if the thread will try to increment epoch, it will notice that DoReset has happened and will stop producing bogus shadow values. So, last_spurious_race is used to remember the last sid+epoch for which RestoreStack returned false. Then it is used to filter out races with the same sid+epoch very early and quickly. It is of course possible that multiple threads left multiple bogus shadow values and all of them are read by lots of threads at the same time. In such case last_spurious_race will only be able to deduplicate a few races from one thread, then few from another and so on. An alternative would be to hold an array of such sid+epoch, but we consider such scenario as even less likely. Note: this can lead to some rare false negatives as well: 1. When a legit access with the same sid+epoch participates in a race as the "previous" memory access, it will be wrongly filtered out. 2. When RestoreStack returns false for a legit memory access because it was already evicted from the thread trace, we will still remember it in last_spurious_race. Then if there is another racing memory access from the same thread that happened in the same epoch, but was stored in the next thread trace part (which is still preserved in the thread trace), we will also wrongly filter it out while RestoreStack would actually succeed for that second memory access. Reviewed By: melver Differential Revision: https://reviews.llvm.org/D130269
Prevent the following pathological behavior: Since memory access handling is not synchronized with DoReset, a thread running concurrently with DoReset can leave a bogus shadow value that will be later falsely detected as a race. For such false races RestoreStack will return false and we will not report it. However, consider that a thread leaves a whole lot of such bogus values and these values are later read by a whole lot of threads. This will cause massive amounts of ReportRace calls and lots of serialization. In very pathological cases the resulting slowdown can be >100x. This is very unlikely, but it was presumably observed in practice: google/sanitizers#1552 If this happens, previous access sid+epoch will be the same for all of these false races b/c if the thread will try to increment epoch, it will notice that DoReset has happened and will stop producing bogus shadow values. So, last_spurious_race is used to remember the last sid+epoch for which RestoreStack returned false. Then it is used to filter out races with the same sid+epoch very early and quickly. It is of course possible that multiple threads left multiple bogus shadow values and all of them are read by lots of threads at the same time. In such case last_spurious_race will only be able to deduplicate a few races from one thread, then few from another and so on. An alternative would be to hold an array of such sid+epoch, but we consider such scenario as even less likely. Note: this can lead to some rare false negatives as well: 1. When a legit access with the same sid+epoch participates in a race as the "previous" memory access, it will be wrongly filtered out. 2. When RestoreStack returns false for a legit memory access because it was already evicted from the thread trace, we will still remember it in last_spurious_race. Then if there is another racing memory access from the same thread that happened in the same epoch, but was stored in the next thread trace part (which is still preserved in the thread trace), we will also wrongly filter it out while RestoreStack would actually succeed for that second memory access. Reviewed By: melver Differential Revision: https://reviews.llvm.org/D130269
Prevent the following pathological behavior: Since memory access handling is not synchronized with DoReset, a thread running concurrently with DoReset can leave a bogus shadow value that will be later falsely detected as a race. For such false races RestoreStack will return false and we will not report it. However, consider that a thread leaves a whole lot of such bogus values and these values are later read by a whole lot of threads. This will cause massive amounts of ReportRace calls and lots of serialization. In very pathological cases the resulting slowdown can be >100x. This is very unlikely, but it was presumably observed in practice: google/sanitizers#1552 If this happens, previous access sid+epoch will be the same for all of these false races b/c if the thread will try to increment epoch, it will notice that DoReset has happened and will stop producing bogus shadow values. So, last_spurious_race is used to remember the last sid+epoch for which RestoreStack returned false. Then it is used to filter out races with the same sid+epoch very early and quickly. It is of course possible that multiple threads left multiple bogus shadow values and all of them are read by lots of threads at the same time. In such case last_spurious_race will only be able to deduplicate a few races from one thread, then few from another and so on. An alternative would be to hold an array of such sid+epoch, but we consider such scenario as even less likely. Note: this can lead to some rare false negatives as well: 1. When a legit access with the same sid+epoch participates in a race as the "previous" memory access, it will be wrongly filtered out. 2. When RestoreStack returns false for a legit memory access because it was already evicted from the thread trace, we will still remember it in last_spurious_race. Then if there is another racing memory access from the same thread that happened in the same epoch, but was stored in the next thread trace part (which is still preserved in the thread trace), we will also wrongly filter it out while RestoreStack would actually succeed for that second memory access. Reviewed By: melver Differential Revision: https://reviews.llvm.org/D130269
Hi! After migration from clang-13 to clang-14 we noticed that our tests with thread sanitizer on average started to work faster, but sometimes it runs extremely slow. When it happens, we see in logs that ClickHouse server hangs for about 10 minutes and then continues to work as if nothing happened.
To understand what happens in that moments I wrote code that added a separate process that attaches gdb to the server and collects stacktraces of all threads every 210 seconds. Collected stacktraces for one such run:
https://s3.amazonaws.com/clickhouse-test-reports/39194/45dfaa3491ff43774463449a3060df631479235c/stateless_tests__thread__actions__[1/3]/tsan_traces.txt
There we can see that there are a lot of threads with stack like that:
As far as I understand the difference between implementations of
__tsan::TraceSwitchPart
in clang-13 and clang-14 is:https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp#L910
and
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp#L958-L979
Both are related to thread slots aka
TidSlot
and their synchronization. There are only 255 slots available now. Unfortunately, we have more than 255 threads (as you can see in the stacktraces I collected) running in the ClickHouse server. I suppose threads may stuck on waiting for several minutes there (maybe the algorithm is unfair?).Also, I see another suspicious stacktrace that leads to
RestoreStack
call:I'm sure that this is the only thread that writes something to the provided
DB::PODArray
. All other function arguments are supposed to be immutable, so it's unclear to me why didn't short path work here.I'm not sure but it probably may be related to this line:
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp#L217
Before LLVM 14
Tid
was used to check if two memory accesses happened in separate threads, but nowSid
is used.This check seems to be not as precise as the previous one because it seems that the corresponding
Sid
for a thread may change during program execution.The text was updated successfully, but these errors were encountered: