Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSAN performance degradation in LLVM 14 #1552

Closed
novikd opened this issue Jul 16, 2022 · 16 comments
Closed

TSAN performance degradation in LLVM 14 #1552

novikd opened this issue Jul 16, 2022 · 16 comments

Comments

@novikd
Copy link

novikd commented Jul 16, 2022

Hi! After migration from clang-13 to clang-14 we noticed that our tests with thread sanitizer on average started to work faster, but sometimes it runs extremely slow. When it happens, we see in logs that ClickHouse server hangs for about 10 minutes and then continues to work as if nothing happened.

To understand what happens in that moments I wrote code that added a separate process that attaches gdb to the server and collects stacktraces of all threads every 210 seconds. Collected stacktraces for one such run:
https://s3.amazonaws.com/clickhouse-test-reports/39194/45dfaa3491ff43774463449a3060df631479235c/stateless_tests__thread__actions__[1/3]/tsan_traces.txt

There we can see that there are a lot of threads with stack like that:

#0  0x000000000b4093ea in __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) ()
#1  0x000000000b40a48a in __sanitizer::Semaphore::Wait() ()
#2  0x000000000b490951 in __tsan::TraceSwitchPartImpl(__tsan::ThreadState*) ()

As far as I understand the difference between implementations of __tsan::TraceSwitchPart in clang-13 and clang-14 is:
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp#L910
and
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp#L958-L979
Both are related to thread slots aka TidSlot and their synchronization. There are only 255 slots available now. Unfortunately, we have more than 255 threads (as you can see in the stacktraces I collected) running in the ClickHouse server. I suppose threads may stuck on waiting for several minutes there (maybe the algorithm is unfair?).

Also, I see another suspicious stacktrace that leads to RestoreStack call:

#0  0x000000000b4a1390 in __tsan::RestoreStack(__tsan::EventType, __tsan::Sid, __tsan::Epoch, unsigned long, unsigned long, unsigned long, unsigned int*, __tsan::VarSizeStackTrace*, __tsan::MutexSet*, unsigned long*) ()
#1  0x000000000b4a3190 in __tsan::ReportRace(__tsan::ThreadState*, __tsan::RawShadow*, __tsan::Shadow, __tsan::Shadow, unsigned long) ()
#2  0x000000001798f276 in void DB::packFixedBatch<wide::integer<128ul, unsigned int> >(unsigned long, std::__1::vector<DB::IColumn const*, std::__1::allocator<DB::IColumn const*> > const&, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > const&, DB::PODArray<wide::integer<128ul, unsigned int>, 4096ul, Allocator<false, false>, 15ul, 16ul>&) ()

I'm sure that this is the only thread that writes something to the provided DB::PODArray. All other function arguments are supposed to be immutable, so it's unclear to me why didn't short path work here.

I'm not sure but it probably may be related to this line:
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp#L217
Before LLVM 14 Tid was used to check if two memory accesses happened in separate threads, but now Sid is used.
This check seems to be not as precise as the previous one because it seems that the corresponding Sid for a thread may change during program execution.

@dvyukov
Copy link
Contributor

dvyukov commented Jul 16, 2022

Grepping out all application blocked threads I got this:
https://gist.githubusercontent.com/dvyukov/4bbd36a5c8f9650fb80a3fb40544d0a1/raw/1e30df7a7948893144ad50e30c9f77f509880c49/gistfile1.txt

There are 2 cases with __tsan::RestoreStack and lots of __sanitizer::FutexWait (what you described).

Plus the first case with lots of __sanitizer::internal_sched_yield. This looks like DenseAllocator contention.

For the rest I don't see anything suspicious/can't explain. For now I would assume that the contention just cleared up by the time the stack traces were taken.

@dvyukov
Copy link
Contributor

dvyukov commented Jul 16, 2022

Unfortunately, we have more than 255 threads (as you can see in the stacktraces I collected) running in the ClickHouse server.

Having 700 threads should not be a problem on its own. We apply tsan to programs with tens of thousands of threads.
There may be a problem if you have more than 256 actively running threads and more than 256 physical cores. But I did not look into what exactly will happen in such case.
In these traces I see at most 80 actively running threads.

@dvyukov
Copy link
Contributor

dvyukov commented Jul 16, 2022

I'm sure that this is the only thread that writes something to the provided DB::PODArray. All other function arguments are supposed to be immutable, so it's unclear to me why didn't short path work here.

It may be an artifact of the new tsan algorithm. This should not happen frequently, but infrequent cases are expected.
It does not print any race reports, right?

@dvyukov
Copy link
Contributor

dvyukov commented Jul 16, 2022

Before LLVM 14 Tid was used to check if two memory accesses happened in separate threads, but now Sid is used.
This check seems to be not as precise as the previous one because it seems that the corresponding Sid for a thread may change during program execution.

The new algorithm is significantly different. There are lots of trade-offs involved and exact difference in precision is very complex. I hope to give a public talk/video of the exact differences soon.

@novikd
Copy link
Author

novikd commented Jul 16, 2022

Thank you for the quick response!

Plus the first case with lots of __sanitizer::internal_sched_yield. This looks like DenseAllocator contention.

I'm not familiar with this code, but it seems quite unrelated to the problem because all these calls were found in stacktraces collected at 2022-07-14 02:28:27.
Next stacktraces were collected at 2022-07-14 02:34:48, and there is no evidence of __sanitizer::internal_sched_yield.

Also as far as I understand logs of the server, it slowed down later at about 2022.07.14 02:40.
The following part of the log says that one query was executed about 17 minutes.

728550:2022.07.14 02:40:14.367797 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> executeQuery: (from [::1]:42408) (comment: 00731_long_merge_tree_select_opened_files.sh) SELECT count() FROM (SELECT toDayOfWeek(date) AS m, id, count() FROM merge_tree_table GROUP BY id, m ORDER BY count() DESC LIMIT 10 SETTINGS max_threads = 1) (stage: Complete)
728551:2022.07.14 02:40:14.373112 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> ContextAccess (default): Access granted: SELECT(id, date) ON test_gg5kva.merge_tree_table
728552:2022.07.14 02:40:14.377466 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> ContextAccess (default): Access granted: SELECT(id, date) ON test_gg5kva.merge_tree_table
728553:2022.07.14 02:40:14.384926 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> ContextAccess (default): Access granted: SELECT(id, date) ON test_gg5kva.merge_tree_table
728554:2022.07.14 02:40:14.387298 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> ContextAccess (default): Access granted: SELECT(id, date) ON test_gg5kva.merge_tree_table
728558:2022.07.14 02:40:14.388200 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
728559:2022.07.14 02:40:14.388759 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
728566:2022.07.14 02:40:14.390070 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> test_gg5kva.merge_tree_table (85b36e5f-6b14-4e3a-a46d-015e05df4be3) (SelectExecutor): Key condition: unknown
728569:2022.07.14 02:40:14.391385 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> test_gg5kva.merge_tree_table (85b36e5f-6b14-4e3a-a46d-015e05df4be3) (SelectExecutor): MinMax index condition: unknown
728572:2022.07.14 02:40:14.391622 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> test_gg5kva.merge_tree_table (85b36e5f-6b14-4e3a-a46d-015e05df4be3) (SelectExecutor): Selected 1/1 parts by partition key, 1 parts by primary key, 1221/1221 marks by primary key, 1221 marks to read from 1 ranges
728573:2022.07.14 02:40:14.391888 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> MergeTreeInOrderSelectProcessor: Reading 5 ranges in order from part 20180801_20180801_1_10_2, approx. 10000000 rows starting from 0
728607:2022.07.14 02:40:14.425852 [ 3325 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> AggregatingInOrderTransform: Aggregating in order
835393:2022.07.14 02:57:04.276190 [ 3325 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> AggregatingInOrderTransform: Aggregated. 10000000 to 10000 rows (from 113.99 MiB)
835394:2022.07.14 02:57:04.278246 [ 3325 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> AggregatingTransform: Aggregating
835395:2022.07.14 02:57:04.278344 [ 3325 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> Aggregator: Aggregation method: without_key
835396:2022.07.14 02:57:04.278467 [ 3325 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> AggregatingTransform: Aggregated. 10 to 1 rows (from 10.00 B) in 1009.885075258 sec. (0.010 rows/sec., 0.01 B/sec.)
835397:2022.07.14 02:57:04.278525 [ 3325 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> Aggregator: Merging aggregated data
835398:2022.07.14 02:57:04.278614 [ 3325 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Trace> Aggregator: Statistics updated for key=14863528974122729875: new sum_of_sizes=1, median_size=1
835399:2022.07.14 02:57:04.280194 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Information> executeQuery: Read 10000000 rows, 95.37 MiB in 1009.911988377 sec., 9901 rows/sec., 96.70 KiB/sec.
835400:2022.07.14 02:57:04.282518 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> MemoryTracker: Peak memory usage (for query): 5.86 MiB.
835401:2022.07.14 02:57:04.282594 [ 26064 ] {d4f79270-f32a-4dca-b1fb-8a6533cb43f3} <Debug> TCPHandler: Processed in 1009.920455701 sec.

Within these 17 minutes, 3 stacktraces snapshots were taken. Each one stops the whole server for about 90 seconds, so 12.5 minutes would be a better estimation, but anyway it's extremely long. On average this test (not only this query, but the whole test that contains it) with thread sanitizer takes 2.3 minutes. Here you can see statistics for this test runs in ClickHouse master:
https://play.clickhouse.com/play?user=play#U0VMRUNUIAogdG9TdGFydE9mRGF5KGNoZWNrX3N0YXJ0X3RpbWUpIEFTIGRheSwKIHJvdW5kKG1heCh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgbWF4X2R1cl9taW51dGVzLAogcm91bmQoYXZnKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBhdmdfZHVyX21pbnV0ZXMsCiByb3VuZChxdWFudGlsZSgwLjEpKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBxMV9kdXJfbWludXRlcywKIHJvdW5kKHF1YW50aWxlKDAuNSkodGVzdF9kdXJhdGlvbl9tcykgLyAxMDAwIC8gNjAsIDEpIEFTIHE1X2R1cl9taW51dGVzLAogcm91bmQocXVhbnRpbGUoMC45KSh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgcTlfZHVyX21pbnV0ZXMKRlJPTSBjaGVja3MKV0hFUkUgJzIwMjItMDQtMDEnIDw9IGNoZWNrX3N0YXJ0X3RpbWUgYW5kIHRlc3RfbmFtZT0nMDA3MzFfbG9uZ19tZXJnZV90cmVlX3NlbGVjdF9vcGVuZWRfZmlsZXMnIGFuZCBjaGVja19uYW1lIGlsaWtlICclc3RhdGVsZXNzJXRocmVhZCUnIGFuZCBjaGVja19kdXJhdGlvbl9tcyA+IDMwMDAwMCBhbmQgcHVsbF9yZXF1ZXN0X251bWJlcj0wCkdST1VQIEJZIGRheQpPUkRFUiBCWSBkYXkgZGVzYw==

Also due to SETTINGS max_threads = 1 part of the query, we can be sure, that it was executed in a single thread. This means that the thread hung for 10+ minutes.

It may be an artifact of the new tsan algorithm. This should not happen frequently, but infrequent cases are expected.
It does not print any race reports, right?

Right, there are no reports from sanitizer. All the information we have about this test run in CI can be found here: https://s3.amazonaws.com/clickhouse-test-reports/39194/45dfaa3491ff43774463449a3060df631479235c/stateless_tests__thread__actions__%5B1/3%5D.html

@novikd
Copy link
Author

novikd commented Jul 16, 2022

Is there anything I can do to provide more information about this situation? Would it be useful if I build clang with TSAN_DEBUG_OUTPUT and try to catch it again?

@dvyukov
Copy link
Contributor

dvyukov commented Jul 18, 2022

Next stacktraces were collected at 2022-07-14 02:34:48, and there is no evidence of __sanitizer::internal_sched_yield.

I see.
Unfortunately I can reproduce the contention with __sanitizer::internal_sched_yield, but I can't reproduce what I think may be happening with ReportRace in synthetic tests.

Does it happen in every run on this test now? Or just in that one run? How reproducible is it?

Would it be useful if I build clang with TSAN_DEBUG_OUTPUT and try to catch it again?

TSAN_DEBUG_OUTPUT mode logs 1 line per memory access in the program, so if it runs for minutes, it will be zetabytes of logs...

Any chance I can reproduce it locally? That would be the fastest to debug.
Alternatively I can start giving you debug patches and you will test, but I afraid of the round-trip time for such debugging.

@dvyukov
Copy link
Contributor

dvyukov commented Jul 18, 2022

I've sent https://reviews.llvm.org/D130002 for the __sanitizer::internal_sched_yield issue. Even if it's not the main one, it's still an issue and it's easily reproducible on a synthetic benchmark. So no reason not to fix.

@novikd
Copy link
Author

novikd commented Jul 19, 2022

Does it happen in every run on this test now? Or just in that one run? How reproducible is it?

It happens from time to time. We see it once in a couple of days. There are the statistics of such test runs in our CI since migration:
https://play.clickhouse.com/play?user=play#c2VsZWN0CiAgICBjb3VudCgpCmZyb20gY2hlY2tzCndoZXJlICcyMDIyLTA0LTAxJyA8PSBjaGVja19zdGFydF90aW1lIGFuZCB0ZXN0X25hbWU9JycgYW5kIGNoZWNrX25hbWUgaWxpa2UgJyVzdGF0ZWxlc3MldGhyZWFkJScgIGFuZCAoY2hlY2tfZHVyYXRpb25fbXMgLyAxMDAwIC8gNjApID4gMTAw

We had broken master several days, so I'd say it happened ~40-50 times since April.

Any chance I can reproduce it locally? That would be the fastest to debug.
Alternatively I can start giving you debug patches and you will test, but I afraid of the round-trip time for such debugging.

It seems for me that it's hard to reproduce locally for now. We are going to try to reproduce it on our side (We suspect it can be related to aggregation/many memory accesses to close memory addresses). Also I want to try to attach perf to the server process during time between the stacktrace collection, maybe it'd help us to understand what happens during this period of time.

@dvyukov
Copy link
Contributor

dvyukov commented Jul 19, 2022

We had broken master several days, so I'd say it happened ~40-50 times since April.

After removing check_duration_ms constraint I got 16411, so it seems that it's failing 1 time out of ~16411/66=250.
You mentioned that normally the test runs for several minutes, which means that the issue happens once per ~10 hours of runtime.

@novikd
Copy link
Author

novikd commented Jul 19, 2022

After removing check_duration_ms constraint I got 16411, so it seems that it's failing 1 time out of ~16411/66=250.
You mentioned that normally the test runs for several minutes, which means that the issue happens once per ~10 hours of runtime.

Sorry for the poor description. The server slows down at random moments, so it affects random tests. So I counted the number of times when during all tests running this problem happened. It means it happens once in 250-300 commits in our repository (on average). It takes about 45 minutes on average to run all tests with TSAN. 45 * 250 / 60 = 187.5 hours or 7-8 days.

@dvyukov
Copy link
Contributor

dvyukov commented Jul 19, 2022

7-8 days explains why I couldn't trigger it :)

I've tried to create absolutely worst case conditions for what I think happens:
https://github.com/dvyukov/llvm-project/pull/6/files

and it indeed triggers spurious ReportRace, but the most I can get is 10s of milliseconds delay in ReportRace:

$ TSAN_OPTIONS=history_size=100 ./bench_spurious_race 144 1000000
#074: ReportRace 0x7b600040c3f0 took 364 us
#098: ReportRace 0x7b60004cabf8 took 22 us
#053: ReportRace 0x7b6000674b98 took 79281 us
#053: ReportRace 0x7b6000674be8 took 155 us
#200: ReportRace 0x7b60005a5e88 took 35452 us
#104: ReportRace 0x7b60007193c0 took 14 us
#104: ReportRace 0x7b60007193e8 took 8 us
#143: ReportRace 0x7b60009c0fb8 took 192 us
#041: ReportRace 0x7b6000589ff8 took 222 us
#191: ReportRace 0x7b6000679ed0 took 27 us
#137: ReportRace 0x7b6000400b10 took 12 us
#101: ReportRace 0x7b60008a56e0 took 139 us
#104: ReportRace 0x7b6000a1d778 took 120 us
#152: ReportRace 0x7b6000ad29d8 took 146 us
#005: ReportRace 0x7b6000ac4ec8 took 80492 us
#284: ReportRace 0x7b60007d5fe0 took 35 us
#017: ReportRace 0x7b6000570358 took 31 us
#317: ReportRace 0x7b6000751520 took 58 us
#317: ReportRace 0x7b6000751578 took 38 us
#215: ReportRace 0x7b60008d7208 took 65 us
#170: ReportRace 0x7b6000a4e760 took 63 us
#368: ReportRace 0x7b6000440bb0 took 47 us
#113: ReportRace 0x7b60006957c0 took 36 us
#152: ReportRace 0x7b6000adeba8 took 45 us
#152: ReportRace 0x7b6000adebe8 took 34 us
#125: ReportRace 0x7b6000b447e0 took 13 us
#230: ReportRace 0x7b60002e87c8 took 21790 us
#233: ReportRace 0x7b6000876bc8 took 63 us
#374: ReportRace 0x7b6000e00dd8 took 63 us
#374: ReportRace 0x7b6000e00fd8 took 43 us
#374: ReportRace 0x7b6000e00ff0 took 45 us
#251: ReportRace 0x7b6000b9cb60 took 341 us
#254: ReportRace 0x7b600094afe0 took 149 us
#365: ReportRace 0x7b60006a47c8 took 710 us
#365: ReportRace 0x7b60006a47d8 took 133 us
#062: ReportRace 0x7b6000edb3e8 took 468 us
#359: ReportRace 0x7b60006323c8 took 14 us
#038: ReportRace 0x7b6000dac7c8 took 132 us
#131: ReportRace 0x7b6000e6c7f0 took 78 us

@dvyukov
Copy link
Contributor

dvyukov commented Jul 19, 2022

More stack traces when such stall happens may shed some light. But otherwise I am out of ideas.

@novikd
Copy link
Author

novikd commented Jul 19, 2022

I accidentally looked at collected stacktraces where no tests failed with a timeout and I find it suspicious. The first stacktraces snapshot contains 22 calls to ReportRace (TSAN reported nothing, so it just was the long path):
https://s3.amazonaws.com/clickhouse-test-reports/39194/3e10c58cf84b2831501a4f8d6eb20e907d8a3cee/stateless_tests__thread__actions__[2/3]/tsan_traces.txt

I looked in the logs what those threads were doing in that period of time and it's very strange:

2022.07.19 05:48:55.568086 [ 8054 ] {4e5c2b06-dc83-4e94-a7c3-792e3a773ecf} <Debug> Aggregator: Merged partially aggregated blocks. 65505 rows, 511.76 KiB. in 1.467138973 sec. (44648.122 rows/sec., 348.81 KiB/sec.)
2022.07.19 05:51:17.132157 [ 8054 ] {4e5c2b06-dc83-4e94-a7c3-792e3a773ecf} <Debug> Aggregator: Merged partially aggregated blocks. 65598 rows, 512.48 KiB. in 140.848265449 sec. (465.735 rows/sec., 3.64 KiB/sec.)

The second merge inside the same thread works almost x100 slower, it shouldn't happen.

So I decided to look at the performance of the test that contains this query. It turns out that after the migration to clang-14 it became slower.

Statistics for runs with TSAN clang-14:
https://play.clickhouse.com/play?user=play#U0VMRUNUIAogdG9TdGFydE9mRGF5KGNoZWNrX3N0YXJ0X3RpbWUpIEFTIGRheSwKIHJvdW5kKG1heCh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgbWF4X2R1cl9taW51dGVzLAogcm91bmQoYXZnKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBhdmdfZHVyX21pbnV0ZXMsCiByb3VuZChxdWFudGlsZSgwLjEpKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBxMV9kdXJfbWludXRlcywKIHJvdW5kKHF1YW50aWxlKDAuNSkodGVzdF9kdXJhdGlvbl9tcykgLyAxMDAwIC8gNjAsIDEpIEFTIHE1X2R1cl9taW51dGVzLAogcm91bmQocXVhbnRpbGUoMC45KSh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgcTlfZHVyX21pbnV0ZXMKRlJPTSBjaGVja3MKV0hFUkUgJzIwMjItMDYtMDQnIDw9IGNoZWNrX3N0YXJ0X3RpbWUgYW5kIHRlc3RfbmFtZT0nMDE3MzBfZGlzdHJpYnV0ZWRfZ3JvdXBfYnlfbm9fbWVyZ2Vfb3JkZXJfYnlfbG9uZycgYW5kIGNoZWNrX25hbWUgaWxpa2UgJyVzdGF0ZWxlc3MldGhyZWFkJScgYW5kIGNoZWNrX2R1cmF0aW9uX21zID4gMzAwMDAwIGFuZCBwdWxsX3JlcXVlc3RfbnVtYmVyPTAKR1JPVVAgQlkgZGF5Ck9SREVSIEJZIGRheSBkZXNj
Statistics for runs with TSAN clang-13:
https://play.clickhouse.com/play?user=play#U0VMRUNUIAogdG9TdGFydE9mRGF5KGNoZWNrX3N0YXJ0X3RpbWUpIEFTIGRheSwKIHJvdW5kKG1heCh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgbWF4X2R1cl9taW51dGVzLAogcm91bmQoYXZnKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBhdmdfZHVyX21pbnV0ZXMsCiByb3VuZChxdWFudGlsZSgwLjEpKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBxMV9kdXJfbWludXRlcywKIHJvdW5kKHF1YW50aWxlKDAuNSkodGVzdF9kdXJhdGlvbl9tcykgLyAxMDAwIC8gNjAsIDEpIEFTIHE1X2R1cl9taW51dGVzLAogcm91bmQocXVhbnRpbGUoMC45KSh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgcTlfZHVyX21pbnV0ZXMKRlJPTSBjaGVja3MKV0hFUkUgJzIwMjItMDYtMDQnID4gY2hlY2tfc3RhcnRfdGltZSBhbmQgdGVzdF9uYW1lPScwMTczMF9kaXN0cmlidXRlZF9ncm91cF9ieV9ub19tZXJnZV9vcmRlcl9ieV9sb25nJyBhbmQgY2hlY2tfbmFtZSBpbGlrZSAnJXN0YXRlbGVzcyV0aHJlYWQlJyBhbmQgY2hlY2tfZHVyYXRpb25fbXMgPiAzMDAwMDAgYW5kIHB1bGxfcmVxdWVzdF9udW1iZXI9MApHUk9VUCBCWSBkYXkKT1JERVIgQlkgZGF5IGRlc2M=

Statistics for runs with MSAN clang-14:
https://play.clickhouse.com/play?user=play#U0VMRUNUIAogdG9TdGFydE9mRGF5KGNoZWNrX3N0YXJ0X3RpbWUpIEFTIGRheSwKIHJvdW5kKG1heCh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgbWF4X2R1cl9taW51dGVzLAogcm91bmQoYXZnKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBhdmdfZHVyX21pbnV0ZXMsCiByb3VuZChxdWFudGlsZSgwLjEpKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBxMV9kdXJfbWludXRlcywKIHJvdW5kKHF1YW50aWxlKDAuNSkodGVzdF9kdXJhdGlvbl9tcykgLyAxMDAwIC8gNjAsIDEpIEFTIHE1X2R1cl9taW51dGVzLAogcm91bmQocXVhbnRpbGUoMC45KSh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgcTlfZHVyX21pbnV0ZXMKRlJPTSBjaGVja3MKV0hFUkUgJzIwMjItMDYtMDQnIDw9IGNoZWNrX3N0YXJ0X3RpbWUgYW5kIHRlc3RfbmFtZT0nMDE3MzBfZGlzdHJpYnV0ZWRfZ3JvdXBfYnlfbm9fbWVyZ2Vfb3JkZXJfYnlfbG9uZycgYW5kIGNoZWNrX25hbWUgaWxpa2UgJyVzdGF0ZWxlc3MlbWVtb3J5JScgYW5kIGNoZWNrX2R1cmF0aW9uX21zID4gMzAwMDAwIGFuZCBwdWxsX3JlcXVlc3RfbnVtYmVyPTAKR1JPVVAgQlkgZGF5Ck9SREVSIEJZIGRheSBkZXNj
Statistics for runs with MSAN clang-13:
https://play.clickhouse.com/play?user=play#U0VMRUNUIAogdG9TdGFydE9mRGF5KGNoZWNrX3N0YXJ0X3RpbWUpIEFTIGRheSwKIHJvdW5kKG1heCh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgbWF4X2R1cl9taW51dGVzLAogcm91bmQoYXZnKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBhdmdfZHVyX21pbnV0ZXMsCiByb3VuZChxdWFudGlsZSgwLjEpKHRlc3RfZHVyYXRpb25fbXMpIC8gMTAwMCAvIDYwLCAxKSBBUyBxMV9kdXJfbWludXRlcywKIHJvdW5kKHF1YW50aWxlKDAuNSkodGVzdF9kdXJhdGlvbl9tcykgLyAxMDAwIC8gNjAsIDEpIEFTIHE1X2R1cl9taW51dGVzLAogcm91bmQocXVhbnRpbGUoMC45KSh0ZXN0X2R1cmF0aW9uX21zKSAvIDEwMDAgLyA2MCwgMSkgQVMgcTlfZHVyX21pbnV0ZXMKRlJPTSBjaGVja3MKV0hFUkUgJzIwMjItMDYtMDQnID4gY2hlY2tfc3RhcnRfdGltZSBhbmQgdGVzdF9uYW1lPScwMTczMF9kaXN0cmlidXRlZF9ncm91cF9ieV9ub19tZXJnZV9vcmRlcl9ieV9sb25nJyBhbmQgY2hlY2tfbmFtZSBpbGlrZSAnJXN0YXRlbGVzcyVtZW1vcnklJyBhbmQgY2hlY2tfZHVyYXRpb25fbXMgPiAzMDAwMDAgYW5kIHB1bGxfcmVxdWVzdF9udW1iZXI9MApHUk9VUCBCWSBkYXkKT1JERVIgQlkgZGF5IGRlc2M=

As you can see performance with MSAN didn't change, but with TSAN it became slower, so it's related to TSAN, not ClickHouse. This test contains aggregation, so this may be the case. I'll continue the investigation.

dvyukov added a commit to dvyukov/llvm-project that referenced this issue Jul 21, 2022
@dvyukov
Copy link
Contributor

dvyukov commented Jul 21, 2022

I think I managed to reproduce something similar. It involves a producer thread and lots of consumer threads that read the same data. I've left my debugging there for future reference:
https://github.com/dvyukov/llvm-project/pull/6/files
The proposed fix is here:
https://reviews.llvm.org/D130240

dvyukov added a commit to llvm/llvm-project that referenced this issue Jul 25, 2022
Prevent the following pathological behavior:
Since memory access handling is not synchronized with DoReset,
a thread running concurrently with DoReset can leave a bogus shadow value
that will be later falsely detected as a race. For such false races
RestoreStack will return false and we will not report it.
However, consider that a thread leaves a whole lot of such bogus values
and these values are later read by a whole lot of threads.
This will cause massive amounts of ReportRace calls and lots of
serialization. In very pathological cases the resulting slowdown
can be >100x. This is very unlikely, but it was presumably observed
in practice: google/sanitizers#1552
If this happens, previous access sid+epoch will be the same for all of
these false races b/c if the thread will try to increment epoch, it will
notice that DoReset has happened and will stop producing bogus shadow
values. So, last_spurious_race is used to remember the last sid+epoch
for which RestoreStack returned false. Then it is used to filter out
races with the same sid+epoch very early and quickly.
It is of course possible that multiple threads left multiple bogus shadow
values and all of them are read by lots of threads at the same time.
In such case last_spurious_race will only be able to deduplicate a few
races from one thread, then few from another and so on. An alternative
would be to hold an array of such sid+epoch, but we consider such scenario
as even less likely.
Note: this can lead to some rare false negatives as well:
1. When a legit access with the same sid+epoch participates in a race
as the "previous" memory access, it will be wrongly filtered out.
2. When RestoreStack returns false for a legit memory access because it
was already evicted from the thread trace, we will still remember it in
last_spurious_race. Then if there is another racing memory access from
the same thread that happened in the same epoch, but was stored in the
next thread trace part (which is still preserved in the thread trace),
we will also wrongly filter it out while RestoreStack would actually
succeed for that second memory access.

Reviewed By: melver

Differential Revision: https://reviews.llvm.org/D130269
@dvyukov
Copy link
Contributor

dvyukov commented Jul 25, 2022

The tentative fix is landed. Please re-open if it still happens with the fix.
And thanks for detailed info about the issue.

@dvyukov dvyukov closed this as completed Jul 25, 2022
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
Prevent the following pathological behavior:
Since memory access handling is not synchronized with DoReset,
a thread running concurrently with DoReset can leave a bogus shadow value
that will be later falsely detected as a race. For such false races
RestoreStack will return false and we will not report it.
However, consider that a thread leaves a whole lot of such bogus values
and these values are later read by a whole lot of threads.
This will cause massive amounts of ReportRace calls and lots of
serialization. In very pathological cases the resulting slowdown
can be >100x. This is very unlikely, but it was presumably observed
in practice: google/sanitizers#1552
If this happens, previous access sid+epoch will be the same for all of
these false races b/c if the thread will try to increment epoch, it will
notice that DoReset has happened and will stop producing bogus shadow
values. So, last_spurious_race is used to remember the last sid+epoch
for which RestoreStack returned false. Then it is used to filter out
races with the same sid+epoch very early and quickly.
It is of course possible that multiple threads left multiple bogus shadow
values and all of them are read by lots of threads at the same time.
In such case last_spurious_race will only be able to deduplicate a few
races from one thread, then few from another and so on. An alternative
would be to hold an array of such sid+epoch, but we consider such scenario
as even less likely.
Note: this can lead to some rare false negatives as well:
1. When a legit access with the same sid+epoch participates in a race
as the "previous" memory access, it will be wrongly filtered out.
2. When RestoreStack returns false for a legit memory access because it
was already evicted from the thread trace, we will still remember it in
last_spurious_race. Then if there is another racing memory access from
the same thread that happened in the same epoch, but was stored in the
next thread trace part (which is still preserved in the thread trace),
we will also wrongly filter it out while RestoreStack would actually
succeed for that second memory access.

Reviewed By: melver

Differential Revision: https://reviews.llvm.org/D130269
arichardson pushed a commit to CTSRD-CHERI/compiler-rt that referenced this issue Sep 12, 2023
Prevent the following pathological behavior:
Since memory access handling is not synchronized with DoReset,
a thread running concurrently with DoReset can leave a bogus shadow value
that will be later falsely detected as a race. For such false races
RestoreStack will return false and we will not report it.
However, consider that a thread leaves a whole lot of such bogus values
and these values are later read by a whole lot of threads.
This will cause massive amounts of ReportRace calls and lots of
serialization. In very pathological cases the resulting slowdown
can be >100x. This is very unlikely, but it was presumably observed
in practice: google/sanitizers#1552
If this happens, previous access sid+epoch will be the same for all of
these false races b/c if the thread will try to increment epoch, it will
notice that DoReset has happened and will stop producing bogus shadow
values. So, last_spurious_race is used to remember the last sid+epoch
for which RestoreStack returned false. Then it is used to filter out
races with the same sid+epoch very early and quickly.
It is of course possible that multiple threads left multiple bogus shadow
values and all of them are read by lots of threads at the same time.
In such case last_spurious_race will only be able to deduplicate a few
races from one thread, then few from another and so on. An alternative
would be to hold an array of such sid+epoch, but we consider such scenario
as even less likely.
Note: this can lead to some rare false negatives as well:
1. When a legit access with the same sid+epoch participates in a race
as the "previous" memory access, it will be wrongly filtered out.
2. When RestoreStack returns false for a legit memory access because it
was already evicted from the thread trace, we will still remember it in
last_spurious_race. Then if there is another racing memory access from
the same thread that happened in the same epoch, but was stored in the
next thread trace part (which is still preserved in the thread trace),
we will also wrongly filter it out while RestoreStack would actually
succeed for that second memory access.

Reviewed By: melver

Differential Revision: https://reviews.llvm.org/D130269
arichardson pushed a commit to CTSRD-CHERI/llvm-project that referenced this issue Jan 9, 2024
Prevent the following pathological behavior:
Since memory access handling is not synchronized with DoReset,
a thread running concurrently with DoReset can leave a bogus shadow value
that will be later falsely detected as a race. For such false races
RestoreStack will return false and we will not report it.
However, consider that a thread leaves a whole lot of such bogus values
and these values are later read by a whole lot of threads.
This will cause massive amounts of ReportRace calls and lots of
serialization. In very pathological cases the resulting slowdown
can be >100x. This is very unlikely, but it was presumably observed
in practice: google/sanitizers#1552
If this happens, previous access sid+epoch will be the same for all of
these false races b/c if the thread will try to increment epoch, it will
notice that DoReset has happened and will stop producing bogus shadow
values. So, last_spurious_race is used to remember the last sid+epoch
for which RestoreStack returned false. Then it is used to filter out
races with the same sid+epoch very early and quickly.
It is of course possible that multiple threads left multiple bogus shadow
values and all of them are read by lots of threads at the same time.
In such case last_spurious_race will only be able to deduplicate a few
races from one thread, then few from another and so on. An alternative
would be to hold an array of such sid+epoch, but we consider such scenario
as even less likely.
Note: this can lead to some rare false negatives as well:
1. When a legit access with the same sid+epoch participates in a race
as the "previous" memory access, it will be wrongly filtered out.
2. When RestoreStack returns false for a legit memory access because it
was already evicted from the thread trace, we will still remember it in
last_spurious_race. Then if there is another racing memory access from
the same thread that happened in the same epoch, but was stored in the
next thread trace part (which is still preserved in the thread trace),
we will also wrongly filter it out while RestoreStack would actually
succeed for that second memory access.

Reviewed By: melver

Differential Revision: https://reviews.llvm.org/D130269
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants