Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.0.9 upgrade to v6.2.0 fail, tiflash report :Exception: sched_setaffinity fail: Invalid argument #5457

Closed
seiya-annie opened this issue Jul 25, 2022 · 5 comments · Fixed by #5459

Comments

@seiya-annie
Copy link

seiya-annie commented Jul 25, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. install v4.0.9 tidb cluster with tiflash node.
    config:
    storage.main.dir:
    • /tiup/data/tiflash-9000
      learner_config:
      server.labels:
      host: tiup-peer
      zone: az1
  2. upgrade this cluster to v6.2.0

2. What did you expect to see? (Required)

upgrade successfully

3. What did you see instead (Required)

[2022/07/25 10:35:43.647 +08:00] [ERROR] [BaseDaemon.cpp:377] [BaseDaemon:########################################] [thread_id=61]
[2022/07/25 10:35:43.647 +08:00] [ERROR] [BaseDaemon.cpp:378] ["BaseDaemon:(from thread 11) Received signal Aborted(6)."] [thread_id=61]
[2022/07/25 10:35:43.647 +08:00] [ERROR] [BaseDaemon.cpp:369] ["BaseDaemon:(from thread 5) Terminate called after throwing an instance of DB::Exception
Code: 0, e.displayText() = DB::Exception: sched_setaffinity fail: Invalid argument, e.what() = DB::Exception
Stack trace:
       0x65a192d\tterminate_handler() [tiflash+106567981]
                \tlibs/libdaemon/src/BaseDaemon.cpp:634
  0x7fc6d1ff5a13\tstd::__terminate(void (*)()) [libc++abi.so.1+236051]
  0x7fc6d1ff8736\t__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) [libc++abi.so.1+247606]
  0x7fc6d1ff86d0\t__cxa_throw [libc++abi.so.1+247504]
       0x72d91dc\tDB::DM::SegmentReader::run() [tiflash+120426972]
                \tdbms/src/Storages/DeltaMerge/ReadThread/SegmentReader.cpp:145
       0x72d98e2\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (DB::DM::SegmentReader::*)(), DB::DM::SegmentReader*> >(void*) [tiflash+120428770]
                \t/usr/local/bin/../include/c++/v1/thread:291
  0x7fc6cdec7ea5\tstart_thread ["] [thread_id=61]
[2022/07/25 10:35:43.647 +08:00] [ERROR] [BaseDaemon.cpp:377] [BaseDaemon:########################################] [thread_id=61]
[2022/07/25 10:35:43.647 +08:00] [ERROR] [BaseDaemon.cpp:378] ["BaseDaemon:(from thread 5) Received signal Aborted(6)."] [thread_id=61]

4. What is your TiFlash version? (Required)

[root@tiup-0 tiflash]# ./tiflash --version
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
TiFlash
Release Version: v6.2.0
Edition:         Community
Git Commit Hash: f11c6c49354f7b7188fce6d5c8640bf5e3f762fb
Git Branch:      heads/refs/tags/v6.2.0
UTC Build Time:  2022-07-22 03:06:59
Enable Features: jemalloc avx avx512 unwind thinlto
Profile:         RELWITHDEBINFO

Raft Proxy
Git Commit Hash:   cdd5996980ecbe5e8d9fe597ec620a5fe394d586
Git Commit Branch: HEAD
UTC Build Time:    2022-07-22 03:10:20
Rust Version:      rustc 1.60.0-nightly (1e12aef3f 2022-02-13)
Storage Engine:    tiflash
Prometheus Prefix: tiflash_proxy_
Profile:           release
[root@tiup-0 tiflash]# 
@seiya-annie seiya-annie added the type/bug The issue is confirmed as a bug. label Jul 25, 2022
@seiya-annie
Copy link
Author

tiflash log:
tiflash.log.gz

@JaySon-Huang
Copy link
Contributor

/cc @JinheLin

@JinheLin
Copy link
Contributor

...
[2022/07/25 10:35:39.328 +08:00] [INFO] [SegmentReader.cpp:207] ["SegmentReaderPoolManager:numa_nodes 2 => [[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38], [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39]]"] [thread_id=1]
[2022/07/25 10:35:39.328 +08:00] [INFO] [SegmentReader.cpp:171] ["SegmentReaderPool:Create SegmentReaderPool thread_count 20 cpus [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38] start"] [thread_id=1]
[2022/07/25 10:35:39.333 +08:00] [INFO] [SegmentReader.cpp:176] ["SegmentReaderPool:Create SegmentReaderPool thread_count 20 cpus [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38] end"] [thread_id=1]
[2022/07/25 10:35:39.333 +08:00] [INFO] [SegmentReader.cpp:171] ["SegmentReaderPool:Create SegmentReaderPool thread_count 20 cpus [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39] start"] [thread_id=1]
[2022/07/25 10:35:39.336 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=2]
[2022/07/25 10:35:39.338 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=3]
[2022/07/25 10:35:39.340 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=4]
[2022/07/25 10:35:39.342 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=5]
[2022/07/25 10:35:39.342 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=6]
[2022/07/25 10:35:39.342 +08:00] [DEBUG] [SegmentReader.cpp:75] ["SegmentReader:sched_setaffinity cpus [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39] succ"] [thread_id=7]
[2022/07/25 10:35:39.342 +08:00] [DEBUG] [SegmentReader.cpp:75] ["SegmentReader:sched_setaffinity cpus [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39] succ"] [thread_id=8]
[2022/07/25 10:35:39.342 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=9]
[2022/07/25 10:35:39.342 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=10]
[2022/07/25 10:35:39.342 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=11]
[2022/07/25 10:35:39.343 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=12]
[2022/07/25 10:35:39.343 +08:00] [ERROR] [SegmentReader.cpp:72] ["SegmentReader:sched_setaffinity fail: Invalid argument"] [thread_id=13]
[2022/07/25 10:35:39.343 +08:00] [DEBUG] [SegmentReader.cpp:75] ["SegmentReader:sched_setaffinity cpus [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39] succ"] [thread_id=14]
...

It seems that the other CPU not work well. I will add some code to handle this exception.

Can you check the CPU information of the machine? @seiya-annie

@Lloyd-Pottiger
Copy link
Contributor

Lloyd-Pottiger commented Jul 25, 2022

the cpu [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38] belongs to the other numa node, so sched_setaffinity should be failed? @JinheLin

@JinheLin
Copy link
Contributor

JinheLin commented Jul 25, 2022

There two NUMA nodes in this machine, so create two SegmentReaderPools (one SegmentReaderPool per NUMA node).

NUMA node(cpus [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39]) is succeed, but the other is failed.

CPU offline can result in sched_setaffinity failed. There maybe more scenarios can result in 'CPU cannot be accessed'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants