Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cockroachdb crashed in Go runtime after reporting EINVAL from port_getn #1130

Closed
davepacheco opened this issue May 27, 2022 · 20 comments
Closed
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@davepacheco
Copy link
Collaborator

While testing #1129:

test integration_tests::disks::test_disk_not_found_before_creation has been running for over 60 seconds

I don't see how this could be related to #1129. More data coming.

@davepacheco
Copy link
Collaborator Author

Where's the test program?

$ ptree
...
  24783  /usr/sbin/sshd -R
    24785  /usr/sbin/sshd -R
      24786  -bash
        7622   /home/dap/.rustup/toolchains/nightly-2022-04-27-x86_64-unknown-illumo
          10286  /home/dap/omicron/target/debug/deps/test_all-d586ea57740e3382
            10622  <defunct>
            10653  clickhouse server --log-file /dangerzone/omicron_tmp/.tmpPfAddb/c

Stacks:

$ pstack 10286
10286:  /home/dap/omicron/target/debug/deps/test_all-d586ea57740e3382
--------------------- thread# 1 / lwp# 1 ---------------------
 fffffc7feec011a7 lwp_park (0, 0, 0)
 fffffc7feebfa65b cond_wait_queue (cf3ffb0, cf3efa0, 0) + 5b
 fffffc7feebfac9a __cond_wait (cf3ffb0, cf3efa0) + 8a
 fffffc7feebfad04 cond_wait (cf3ffb0, cf3efa0) + 24
 fffffc7feebfad3b pthread_cond_wait (cf3ffb0, cf3efa0) + b
 000000000b761bc1 _ZN3std6thread4park17hae0c2631805e5004E () + b1
 000000000b77ad55 _ZN3std4sync4mpsc8blocking9WaitToken4wait17he7ab6597568f2f43E () + 25
 0000000006e98c31 _ZN3std4sync4mpsc6shared15Packet$LT$T$GT$4recv17hc177a08fa8ea75c1E () + 221
 0000000006e97d91 _ZN3std4sync4mpsc17Receiver$LT$T$GT$4recv17h0eb76c22b0535639E () + a1
 0000000006ec5c62 _ZN4test9run_tests17hac0c8685aacc3e6eE () + 26e2
 0000000006eacb90 _ZN4test7console17run_tests_console17h1f2eb8456f97a823E () + 520
 0000000006ec22d9 _ZN4test9test_main17hb6ffc10996341de4E () + 1b9
 0000000006ec3177 _ZN4test16test_main_static17haaea39169cfe956bE () + 67
 00000000066a18c5 _ZN8test_all4main17h364d367ff1516a7dE () + 15
 00000000064dbb2e _ZN4core3ops8function6FnOnce9call_once17hdf71091bbf4ea904E () + e
 0000000006626591 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h92de90bd323235bdE () + 11
 0000000006d0a1d4 _ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h583329574accbd6eE () + 14
 000000000b760ed7 _ZN3std2rt19lang_start_internal17h0fb6664297a09837E () + 1d7
 0000000006d0a1ae _ZN3std2rt10lang_start17h3d000a75ee6e58eaE () + 2e
 00000000066a18ef main () + 1f
 0000000006474b77 _start_crt () + 87
 0000000006474ad8 _start () + 18
------------------- thread# 121 / lwp# 121 -------------------
 fffffc7feec07aba ioctl    (4c, d001, fffffc7fdf1f1330)
 000000000b65c205 _ZN3mio3sys4unix8selector5epoll8Selector6select17h35ffae35187b07ccE () + 95
 000000000b666ca8 _ZN3mio4poll4Poll4poll17hd8261eaa63c7d928E () + 58
 000000000b531611 _ZN5tokio2io6driver6Driver4turn17h62db01a31c363cedE () + 111
 000000000b531f2b _ZN63_$LT$tokio..io..driver..Driver$u20$as$u20$tokio..park..Park$GT$12park_timeout17h3c45c69a99a7e94dE () + 2b
 000000000b5be2dc _ZN73_$LT$tokio..signal..unix..driver..Driver$u20$as$u20$tokio..park..Park$GT$12park_timeout17hd20f4bc7d0041819E () + 1c
 000000000b4e783c _ZN73_$LT$tokio..process..imp..driver..Driver$u20$as$u20$tokio..park..Park$GT$12park_timeout17h69468f274db5e473E () + 1c
 000000000b56eaa4 _ZN78_$LT$tokio..park..either..Either$LT$A$C$B$GT$$u20$as$u20$tokio..park..Park$GT$12park_timeout17hff8ed325c5590309E () + 44
 000000000b55fd10 _ZN5tokio4time6driver15Driver$LT$P$GT$12park_timeout17h231cc2ef542e4606E () + 60
 000000000b55efc9 _ZN5tokio4time6driver15Driver$LT$P$GT$13park_internal17ha4c0d1c94eb709fdE () + 309
 000000000b55f51d _ZN74_$LT$tokio..time..driver..Driver$LT$P$GT$$u20$as$u20$tokio..park..Park$GT$4park17had4fde7655809497E () + 1d
 000000000b56e8cf _ZN78_$LT$tokio..park..either..Either$LT$A$C$B$GT$$u20$as$u20$tokio..park..Park$GT$4park17h045e2c936ab48607E () + 2f
 000000000b5409e1 _ZN68_$LT$tokio..runtime..driver..Driver$u20$as$u20$tokio..park..Park$GT$4park17hd68fd1e690dc4624E () + 11
 000000000b53f021 _ZN5tokio7runtime15basic_scheduler7Context4park28_$u7b$$u7b$closure$u7d$$u7d$17h47e8a92b18bf7a0aE () + 11
 000000000b53f42e _ZN5tokio7runtime15basic_scheduler7Context5enter17h7294c924da3792e6E () + 13e
 000000000b53ed78 _ZN5tokio7runtime15basic_scheduler7Context4park17hba4c872fb9ff81d2E () + 218
 0000000006687bd6 _ZN5tokio7runtime15basic_scheduler9CoreGuard8block_on28_$u7b$$u7b$closure$u7d$$u7d$17he20670f4f92023f5E () + 446
 0000000006670ceb _ZN5tokio7runtime15basic_scheduler9CoreGuard5enter28_$u7b$$u7b$closure$u7d$$u7d$17hf40e1d64fca5f192E () + 2b
 0000000006bfb0a1 _ZN5tokio6macros10scoped_tls18ScopedKey$LT$T$GT$3set17h64bb49fc6b8c6e6cE () + 91
 00000000066664fa _ZN5tokio7runtime15basic_scheduler9CoreGuard5enter17h3e025b3fdfe822d1E () + 15a
 00000000066719ac _ZN5tokio7runtime15basic_scheduler9CoreGuard8block_on17h953ec13feac6e171E () + 3c
 000000000662e8a5 _ZN5tokio7runtime15basic_scheduler14BasicScheduler8block_on17h4fa41925d5ca493cE () + 135
 0000000006b2e223 _ZN5tokio7runtime7Runtime8block_on17he95a59f4174b907eE () + e3
 0000000006a026db _ZN8test_all17integration_tests5disks35test_disk_not_found_before_creation17ha6ee58cbe967e95bE () + 12b
 0000000006c92db1 _ZN8test_all17integration_tests5disks35test_disk_not_found_before_creation28_$u7b$$u7b$closure$u7d$$u7d$17ha5f1cffb2f2f7d4dE () + 11
 00000000064d9e61 _ZN4core3ops8function6FnOnce9call_once17hbd8900751ef4e765E () + 11
 0000000006ecd056 _ZN4test28__rust_begin_short_backtrace17h080f41974328d4ebE () + 6
 0000000006ecbf1a _ZN4test8run_test14run_test_inner28_$u7b$$u7b$closure$u7d$$u7d$17heb85ac5f8b6f021dE () + 31a
 0000000006e94fed _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17ha5c71ba99a177e78E () + cd
 0000000006e9bb2d _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17he05ce777e4a09f65E () + 6d
 000000000b788907 _ZN3std3sys4unix6thread6Thread3new12thread_start17h4253fc05178a52beE () + 27
 fffffc7feec00e4c _thrp_setup (fffffc7fee27b240) + 6c
 fffffc7feec01160 _lwp_start ()
-------------------- thread# 60 / lwp# 60 --------------------
 fffffc7feec011a7 lwp_park (0, fffffc7fedf27e60, 0)
 fffffc7feebfa65b cond_wait_queue (fffffc7feefc6820, fffffc7feefc6840, fffffc7fedf27e60) + 5b
 fffffc7feebfaaad cond_wait_common (fffffc7feefc6820, fffffc7feefc6840, fffffc7fedf27e60) + 1dd
 fffffc7feebfadb5 __cond_timedwait (fffffc7feefc6820, fffffc7feefc6840, fffffc7fedf27f50) + 65
 fffffc7feebfae71 cond_timedwait (fffffc7feefc6820, fffffc7feefc6840, fffffc7fedf27f50) + 31
 fffffc7feef79dd6 umem_update_thread (0) + 1c6
 fffffc7feec00e4c _thrp_setup (fffffc7fee277a40) + 6c
 fffffc7feec01160 _lwp_start ()
------------------- thread# 122 / lwp# 122 -------------------
 fffffc7feec011a7 lwp_park (0, 0, 0)
 fffffc7feebfa65b cond_wait_queue (14045ed0, 1404dd90, 0) + 5b
 fffffc7feebfac9a __cond_wait (14045ed0, 1404dd90) + 8a
 fffffc7feebfad04 cond_wait (14045ed0, 1404dd90) + 24
 fffffc7feebfad3b pthread_cond_wait (14045ed0, 1404dd90) + b
 000000000b761bc1 _ZN3std6thread4park17hae0c2631805e5004E () + b1
 000000000ab4d74c _ZN17crossbeam_channel7context7Context10wait_until17h3bbcdb824fd45064E () + 1ac
 000000000ab58b67 _ZN17crossbeam_channel7flavors5array16Channel$LT$T$GT$4recv28_$u7b$$u7b$closure$u7d$$u7d$17h2fe513ed13821f72E () + e7
 000000000ab4de01 _ZN17crossbeam_channel7context7Context4with28_$u7b$$u7b$closure$u7d$$u7d$17h97b277c478638280E () + 61
 000000000ab4dd00 _ZN17crossbeam_channel7context7Context4with28_$u7b$$u7b$closure$u7d$$u7d$17h877ce4b5e6d5d49cE () + a0
 000000000ab60aff _ZN3std6thread5local17LocalKey$LT$T$GT$8try_with17h2a07f81bfddff0b5E () + af
 000000000ab4d9e7 _ZN17crossbeam_channel7context7Context4with17h922ceec444779709E () + 47
 000000000ab58a14 _ZN17crossbeam_channel7flavors5array16Channel$LT$T$GT$4recv17hea92227fb637b65fE () + 154
 000000000ab72714 _ZN17crossbeam_channel7channel17Receiver$LT$T$GT$4recv17h212c29ff1cd2ba17E () + 204
 000000000ab4b1c5 _ZN10slog_async25AsyncCoreBuilder$LT$D$GT$12spawn_thread28_$u7b$$u7b$closure$u7d$$u7d$17ha62989127caa6e85E () + 25
 000000000ab7bb52 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h1605cea4671dcc28E () + 22
 000000000ab509e2 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$28_$u7b$$u7b$closure$u7d$$u7d$17he1f9c99ba43ff0d7E () + 22
 000000000ab84772 _ZN115_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$9call_once17h256fffcc9a91977fE () + 22
 000000000ab7be14 _ZN3std9panicking3try7do_call17hcaae20b172cee6aaE () + 44
 000000000ab83c0d __rust_try () + 1d
 000000000ab7bd4f _ZN3std9panicking3try17he4d80c0351860b61E () + 7f
 000000000ab9fc52 _ZN3std5panic12catch_unwind17h12b195f1be0ae8e0E () + 22
 000000000ab50469 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$17h8f01426feda027e4E () + 199
 000000000aba0031 _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17hc331ca97bee2dd24E () + 11
 000000000b788907 _ZN3std3sys4unix6thread6Thread3new12thread_start17h4253fc05178a52beE () + 27
 fffffc7feec00e4c _thrp_setup (fffffc7fee271a40) + 6c
 fffffc7feec01160 _lwp_start ()
------- thread# 378 / lwp# 378 [tokio-runtime-worker] --------
 fffffc7feec07aba ioctl    (e1, d001, fffffc7fd21fe280)
 000000000b65c205 _ZN3mio3sys4unix8selector5epoll8Selector6select17h35ffae35187b07ccE () + 95
 000000000b666ca8 _ZN3mio4poll4Poll4poll17hd8261eaa63c7d928E () + 58
 000000000b531611 _ZN5tokio2io6driver6Driver4turn17h62db01a31c363cedE () + 111
 000000000b531e9d _ZN63_$LT$tokio..io..driver..Driver$u20$as$u20$tokio..park..Park$GT$4park17h07344093db983611E () + 1d
 000000000b5be255 _ZN73_$LT$tokio..signal..unix..driver..Driver$u20$as$u20$tokio..park..Park$GT$4park17h32443ac53d15f3e1E () + 15
 000000000b4e77b5 _ZN73_$LT$tokio..process..imp..driver..Driver$u20$as$u20$tokio..park..Park$GT$4park17h80c1aa9cce54f77cE () + 15
 000000000b56e95f _ZN78_$LT$tokio..park..either..Either$LT$A$C$B$GT$$u20$as$u20$tokio..park..Park$GT$4park17h87957d8fe038fbd9E () + 2f
 000000000b55f1dd _ZN5tokio4time6driver15Driver$LT$P$GT$13park_internal17ha4c0d1c94eb709fdE () + 51d
 000000000b55f51d _ZN74_$LT$tokio..time..driver..Driver$LT$P$GT$$u20$as$u20$tokio..park..Park$GT$4park17had4fde7655809497E () + 1d
 000000000b56e8cf _ZN78_$LT$tokio..park..either..Either$LT$A$C$B$GT$$u20$as$u20$tokio..park..Park$GT$4park17h045e2c936ab48607E () + 2f
 000000000b5409e1 _ZN68_$LT$tokio..runtime..driver..Driver$u20$as$u20$tokio..park..Park$GT$4park17hd68fd1e690dc4624E () + 11
 000000000b57434b _ZN5tokio7runtime11thread_pool4park5Inner11park_driver17h9ec7762d171a07f4E () + 8b
 000000000b573ee1 _ZN5tokio7runtime11thread_pool4park5Inner4park17h7ddafe237606fcb0E () + 191
 000000000b573aae _ZN79_$LT$tokio..runtime..thread_pool..park..Parker$u20$as$u20$tokio..park..Park$GT$4park17hcc3f99c4457efe41E () + 1e
 000000000b5bc178 _ZN5tokio7runtime11thread_pool6worker7Context12park_timeout17hdcab00379890266fE () + 1f8
 000000000b5bbe29 _ZN5tokio7runtime11thread_pool6worker7Context4park17hacb8e8d0bc076225E () + 139
 000000000b5bb1d8 _ZN5tokio7runtime11thread_pool6worker7Context3run17h7ac483f6589547b4E () + 1d8
 000000000b5baf85 _ZN5tokio7runtime11thread_pool6worker3run28_$u7b$$u7b$closure$u7d$$u7d$17hce6b9a1f76319a49E () + 15
 000000000b4ddeec _ZN5tokio6macros10scoped_tls18ScopedKey$LT$T$GT$3set17h04d4937b3ffeaa05E () + 7c
 000000000b5baef6 _ZN5tokio7runtime11thread_pool6worker3run17h8a79c3751600b1f1E () + 166
 000000000b5bad81 _ZN5tokio7runtime11thread_pool6worker6Launch6launch28_$u7b$$u7b$closure$u7d$$u7d$17h9dc03fb40a186078E () + 11
 000000000b573430 _ZN102_$LT$tokio..runtime..blocking..task..BlockingTask$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$4poll17hdbe03b0a13617fa4E () + 80
 000000000b59f656 _ZN5tokio7runtime4task4core18CoreStage$LT$T$GT$4poll28_$u7b$$u7b$closure$u7d$$u7d$17h3c7dbcf7b64b9b9cE () + 66
 000000000b52ad85 _ZN5tokio4loom3std11unsafe_cell19UnsafeCell$LT$T$GT$8with_mut17ha03cf47d74dcdab0E () + 55
 000000000b59e985 _ZN5tokio7runtime4task4core18CoreStage$LT$T$GT$4poll17h250badce41553f9eE () + 25
 000000000b5e2ce8 _ZN5tokio7runtime4task7harness11poll_future28_$u7b$$u7b$closure$u7d$$u7d$17hdc0fbe7cde8e3b70E () + 28
 000000000b560685 _ZN115_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$9call_once17h5ee659ac963e2bc0E () + 15
 000000000b5f992a _ZN3std9panicking3try7do_call17ha45d954ae19fab2eE () + 3a
 000000000b5ff23d __rust_try () + 1d
 000000000b5f5129 _ZN3std9panicking3try17h81a410db2b7f8fffE () + 69
 000000000b4db25c _ZN3std5panic12catch_unwind17h4d8dd91d281de1e7E () + 1c
 000000000b5def84 _ZN5tokio7runtime4task7harness11poll_future17h264a98b51a7b8dbeE () + 74
 000000000b5cfc00 _ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$10poll_inner17h9f6391d70b0b69f2E () + 130
 000000000b5cd5c5 _ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h2c13655571dc8d0aE () + 15
 000000000b553bf2 _ZN5tokio7runtime4task3raw4poll17h9b3b8662d4e29e60E () + 22
 000000000b55377e _ZN5tokio7runtime4task3raw7RawTask4poll17hf919b1d66888dd2eE () + 2e
 000000000b539b4e _ZN5tokio7runtime4task20UnownedTask$LT$S$GT$3run17h5af1bda814c84d53E () + 2e
 000000000b56ee39 _ZN5tokio7runtime8blocking4pool4Task3run17hc9fe2302ec3cd3c6E () + 19
 000000000b571937 _ZN5tokio7runtime8blocking4pool5Inner3run17h2240e8859f9465a7E () + 977
 000000000b570f22 _ZN5tokio7runtime8blocking4pool7Spawner12spawn_thread28_$u7b$$u7b$closure$u7d$$u7d$17h85351a04c0f5c74bE () + b2
 000000000b5613c0 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17hb5decbed2181c435E () + 30
 000000000b577800 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$28_$u7b$$u7b$closure$u7d$$u7d$17hadb1ee7998241e25E () + 30
 000000000b560f00 _ZN115_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$9call_once17hd07474e5bd04a469E () + 30
 000000000b5f9589 _ZN3std9panicking3try7do_call17h914d96aad3aec732E () + 49
 000000000b5ff23d __rust_try () + 1d
 000000000b5f3124 _ZN3std9panicking3try17h3a148fd488e13234E () + b4
 000000000b4dc060 _ZN3std5panic12catch_unwind17hea95d0ae36f34a12E () + 30
 000000000b5775f4 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$17h0070d6f31ff325b2E () + 154
 000000000b4b9fe1 _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h976f9b1526746fd3E () + 11
 000000000b788907 _ZN3std3sys4unix6thread6Thread3new12thread_start17h4253fc05178a52beE () + 27
 fffffc7feec00e4c _thrp_setup (fffffc7fed161a40) + 6c
 fffffc7feec01160 _lwp_start ()
------------------- thread# 377 / lwp# 377 -------------------
 fffffc7feec011a7 lwp_park (0, 0, 0)
 fffffc7feebfa65b cond_wait_queue (ded9568, ded9550, 0) + 5b
 fffffc7feebfac9a __cond_wait (ded9568, ded9550) + 8a
 fffffc7feebfad04 cond_wait (ded9568, ded9550) + 24
 fffffc7feebfad3b pthread_cond_wait (ded9568, ded9550) + b
 000000000b6808cc _ZN117_$LT$parking_lot_core..thread_parker..imp..ThreadParker$u20$as$u20$parking_lot_core..thread_parker..ThreadParkerT$GT$4park17hfe9a470429d64e69E () + 12c
 000000000b689921 _ZN16parking_lot_core11parking_lot4park28_$u7b$$u7b$closure$u7d$$u7d$17h048df2b73889a528E () + 291
 000000000b689054 _ZN16parking_lot_core11parking_lot4park17hb97dbf2fc86435c1E () + 144
 000000000b682551 _ZN11parking_lot7condvar7Condvar19wait_until_internal17hd6a408285f2ebeceE () + 101
 000000000b4da75e _ZN11parking_lot7condvar7Condvar4wait17hb9add32139490e04E () + 4e
 000000000b597239 _ZN5tokio4loom3std11parking_lot7Condvar4wait17hca6bee674b17bea0E () + 19
 000000000b4df027 _ZN5tokio4park6thread5Inner4park17h828fe786765f3892E () + 167
 000000000b4dffce _ZN75_$LT$tokio..park..thread..CachedParkThread$u20$as$u20$tokio..park..Park$GT$4park28_$u7b$$u7b$closure$u7d$$u7d$17hc0099e3dd854a8adE () + 1e
 000000000b4dfa89 _ZN5tokio4park6thread16CachedParkThread12with_current28_$u7b$$u7b$closure$u7d$$u7d$17habe5aa6d6271bfb9E () + 19
 000000000b5e5ee3 _ZN3std6thread5local17LocalKey$LT$T$GT$8try_with17hccb8c4717c6ee0f9E () + a3
 000000000b4df928 _ZN5tokio4park6thread16CachedParkThread12with_current17h34a73b9f349ebf33E () + 18
 000000000b4dff61 _ZN75_$LT$tokio..park..thread..CachedParkThread$u20$as$u20$tokio..park..Park$GT$4park17he54a6932887491f4E () + 11
 0000000006faea5f _ZN5tokio4park6thread16CachedParkThread8block_on17h1475d50aa08fff62E () + 2ff
 0000000006eec9a1 _ZN5tokio7runtime5enter5Enter8block_on17h43ea6634ee02b614E () + 71
 0000000006efeb8d _ZN5tokio7runtime11thread_pool10ThreadPool8block_on17h12e60997ec38b8caE () + 7d
 0000000006ff0e1c _ZN5tokio7runtime7Runtime8block_on17h01204a9cb25cea8aE () + 12c
 0000000006f11206 _ZN8httptest6server13ServerBuilder3run28_$u7b$$u7b$closure$u7d$$u7d$17hc09c1a93c296ac4cE () + 166
 0000000006fe6218 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h975b8c64626c7db4E () + 28
 0000000006fcc878 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$28_$u7b$$u7b$closure$u7d$$u7d$17h16839c8ed4b7927eE () + 28
 0000000006f12358 _ZN115_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$9call_once17hcdec44cafec7ef4eE () + 28
 0000000006fe00f1 _ZN3std9panicking3try7do_call17hd603177141a9ae96E () + 41
 0000000006fe466d __rust_try () + 1d
 0000000006fdf815 _ZN3std9panicking3try17he9cf2be265e2c795E () + 85
 0000000006f7adf8 _ZN3std5panic12catch_unwind17h468a881da0f071b2E () + 28
 0000000006fcc671 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$17h0561c931d004734cE () + 151
 0000000006fa2301 _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17hb0a57e60cbbaf556E () + 11
 000000000b788907 _ZN3std3sys4unix6thread6Thread3new12thread_start17h4253fc05178a52beE () + 27
 fffffc7feec00e4c _thrp_setup (fffffc7fee28f240) + 6c
 fffffc7feec01160 _lwp_start ()
------------------- thread# 449 / lwp# 449 -------------------
 000000000b7888e0 _ZN3std3sys4unix6thread6Thread3new12thread_start17h4253fc05178a52beE(), exit value = 0x0000000000000000
        ** zombie (exited, not detached, not yet joined) **
------- thread# 450 / lwp# 450 [tokio-runtime-worker] --------
 fffffc7feec011a7 lwp_park (0, fffffc7fd93fef50, 0)
 fffffc7feebfa65b cond_wait_queue (10f6cb18, 10f6cb00, fffffc7fd93fef50) + 5b
 fffffc7feebfaaad cond_wait_common (10f6cb18, 10f6cb00, fffffc7fd93fef50) + 1dd
 fffffc7feebfae0a __cond_timedwait (10f6cb18, 10f6cb00, fffffc7fd93ff1b8) + ba
 fffffc7feebfae71 cond_timedwait (10f6cb18, 10f6cb00, fffffc7fd93ff1b8) + 31
 fffffc7feebfaeb0 pthread_cond_timedwait (10f6cb18, 10f6cb00, fffffc7fd93ff1b8) + 10
 000000000b680500 _ZN117_$LT$parking_lot_core..thread_parker..imp..ThreadParker$u20$as$u20$parking_lot_core..thread_parker..ThreadParkerT$GT$10park_until17h7663de2adb54ae60E () + 380
 000000000b68994a _ZN16parking_lot_core11parking_lot4park28_$u7b$$u7b$closure$u7d$$u7d$17h048df2b73889a528E () + 2ba
 000000000b689054 _ZN16parking_lot_core11parking_lot4park17hb97dbf2fc86435c1E () + 144
 000000000b682551 _ZN11parking_lot7condvar7Condvar19wait_until_internal17hd6a408285f2ebeceE () + 101
 000000000b4da7f2 _ZN11parking_lot7condvar7Condvar8wait_for17h035187d565d4b030E () + 82
 000000000b5972db _ZN5tokio4loom3std11parking_lot7Condvar12wait_timeout17h22062eae38617ce8E () + 4b
 000000000b571226 _ZN5tokio7runtime8blocking4pool5Inner3run17h2240e8859f9465a7E () + 266
 000000000b570f22 _ZN5tokio7runtime8blocking4pool7Spawner12spawn_thread28_$u7b$$u7b$closure$u7d$$u7d$17h85351a04c0f5c74bE () + b2
 000000000b5613c0 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17hb5decbed2181c435E () + 30
 000000000b577800 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$28_$u7b$$u7b$closure$u7d$$u7d$17hadb1ee7998241e25E () + 30
 000000000b560f00 _ZN115_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$9call_once17hd07474e5bd04a469E () + 30
 000000000b5f9589 _ZN3std9panicking3try7do_call17h914d96aad3aec732E () + 49
 000000000b5ff23d __rust_try () + 1d
 000000000b5f3124 _ZN3std9panicking3try17h3a148fd488e13234E () + b4
 000000000b4dc060 _ZN3std5panic12catch_unwind17hea95d0ae36f34a12E () + 30
 000000000b5775f4 _ZN3std6thread7Builder16spawn_unchecked_28_$u7b$$u7b$closure$u7d$$u7d$17h0070d6f31ff325b2E () + 154
 000000000b4b9fe1 _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h976f9b1526746fd3E () + 11
 000000000b788907 _ZN3std3sys4unix6thread6Thread3new12thread_start17h4253fc05178a52beE () + 27
 fffffc7feec00e4c _thrp_setup (fffffc7fee283240) + 6c
 fffffc7feec01160 _lwp_start ()

I didn't look at that very closely, but enough to not see it obviously spinning in a loop or blocked on a specific operation. [narrator: maybe could have looked more closely.]

It looks blocked. Is that right? Let's see what it's doing over time:

dap@ivanova ~ $ prstat -mLc -p 10286 1
Please wait...
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP   
 10286 dap      0.1 0.1 0.0 0.0 0.0  98 0.0 0.0   2   0   8   0 test_all-d586ea/60
 10286 dap      0.1 0.0 0.0 0.0 0.0 0.0  98 0.0   4   0  16   0 test_all-d586ea/121
 10286 dap      0.0 0.0 0.0 0.0 0.0  98 0.0 0.0   3   0 102   0 test_all-d586ea/122
 10286 dap      0.0 0.0 0.0 0.0 0.0  98 0.1 0.0   2   0   6   0 test_all-d586ea/1
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   9   0 test_all-d586ea/451 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0  99 0.0   0   0   1   0 test_all-d586ea/378 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0   0   0   0 test_all-d586ea/450
 10286 dap      0.0 0.0 0.0 0.0 0.0  99 0.0 0.0   0   0   0   0 test_all-d586ea/377
Total: 1 processes, 8 lwps, load averages: 0.46, 4.34, 7.11
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP   
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0   0   0   0 test_all-d586ea/450
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/451 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/377
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 test_all-d586ea/378 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/122
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/60
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 test_all-d586ea/121
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/1
Total: 1 processes, 8 lwps, load averages: 0.46, 4.32, 7.10
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP   
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   1   0   1   0 test_all-d586ea/60
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0   0   0   0 test_all-d586ea/450
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/451 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/377
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 test_all-d586ea/378 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/122
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 test_all-d586ea/121
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/1
Total: 1 processes, 8 lwps, load averages: 0.45, 4.31, 7.09
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP   
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0   0   0   0 test_all-d586ea/450
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/451 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/377
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 test_all-d586ea/378 [tokio-runtime-worker]
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/122
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/60
 10286 dap      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 test_all-d586ea/121
 10286 dap      0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 test_all-d586ea/1
Total: 1 processes, 8 lwps, load averages: 0.45, 4.29, 7.09

Yeah, it's not on-CPU much. I figured at this point I'd save a core file. It's been running for 5 minutes:

dap@ivanova ~ $ ps -opid,etime -p 10286
  PID     ELAPSED
10286       05:10
dap@ivanova ~ $ gcore 10286
gcore: core.10286 dumped

Maybe we can infer something about what it's doing from the syscalls it's making?

$ truss -p 10286
/377:   lwp_park(0x00000000, 0)         (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330) (sleeping...)
/378:   ioctl(225, DP_POLL, 0xFFFFFC7FD21FE280) (sleeping...)
/122:   lwp_park(0x00000000, 0)         (sleeping...)
/60:    lwp_park(0xFFFFFC7FEDF27E60, 0) (sleeping...)
/457:   lwp_park(0xFFFFFC7FD35FEF50, 0) (sleeping...)
/1:     lwp_park(0x00000000, 0)         (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/121:   lwp_unpark(457)                                 = 0
/457:   lwp_park(0xFFFFFC7FD35FEF50, 0)                 = 0
/457:   getrandom("01 ;F2 *07 o P kA5A70782".., 16, 0)  = 16
/457:   getuid()                                        = 501 [501]
/457:   getuid()                                        = 501 [501]
/457:   getuid()                                        = 501 [501]
/457:   door_info(86, 0xFFFFFC7FD35FCB60)               = 0
/457:   door_call(86, 0xFFFFFC7FD35FCBE0)               = 0
/457:   stat("/home/dap/.pgpass", 0xFFFFFC7FD35FD4C0)   Err#2 ENOENT
/457:   so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 3
/457:   setsockopt(3, tcp, TCP_NODELAY, 0xFFFFFC7FD35FD288, 4, SOV_DEFAULT) = 0
/457:   fcntl(3, F_GETFL)                               = 2
/457:   fcntl(3, F_SETFL, FWRITE|FNONBLOCK)             = 0
/457:   fcntl(3, F_SETFD, 0x00000001)                   = 0
/457:   setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, 0xFFFFFC7FD35FD284, 4, SOV_DEFAULT) = 0
/457:   connect(3, 0x13C05CF0, 16, SOV_DEFAULT)         Err#146 ECONNREFUSED
/457:   close(3)                                        = 0
/457:   read(82, "01", 4096)                            = 1
/457:   read(82, 0xFFFFFC7FD35FDE90, 4096)              Err#11 EAGAIN
/457:   write(87, "01", 1)                              = 1
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 1
/121:   read(82, "01", 4096)                            = 1
/121:   read(82, 0xFFFFFC7FDF1EEC10, 4096)              Err#11 EAGAIN
/121:   write(87, "01", 1)                              = 1
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 1
/457:   lwp_park(0xFFFFFC7FD35FEF50, 0) (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330) (sleeping...)
/60:    lwp_park(0xFFFFFC7FEDF27E60, 0)                 Err#62 ETIME
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/60:    lwp_park(0xFFFFFC7FEDF27E60, 0) (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330) (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/121:   lwp_unpark(457)                                 = 0
/457:   lwp_park(0xFFFFFC7FD35FEF50, 0)                 = 0
/457:   getrandom(" e O17 \1AFA < MF516F9DA".., 16, 0)  = 16
/457:   getuid()                                        = 501 [501]
/457:   getuid()                                        = 501 [501]
/457:   getuid()                                        = 501 [501]
/457:   door_info(86, 0xFFFFFC7FD35FCB60)               = 0
/457:   door_call(86, 0xFFFFFC7FD35FCBE0)               = 0
/457:   stat("/home/dap/.pgpass", 0xFFFFFC7FD35FD4C0)   Err#2 ENOENT
/457:   so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 3
/457:   setsockopt(3, tcp, TCP_NODELAY, 0xFFFFFC7FD35FD288, 4, SOV_DEFAULT) = 0
/457:   fcntl(3, F_GETFL)                               = 2
/457:   fcntl(3, F_SETFL, FWRITE|FNONBLOCK)             = 0
/457:   fcntl(3, F_SETFD, 0x00000001)                   = 0
/457:   setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, 0xFFFFFC7FD35FD284, 4, SOV_DEFAULT) = 0
/457:   connect(3, 0x0FD5DF90, 16, SOV_DEFAULT)         Err#146 ECONNREFUSED
/457:   close(3)                                        = 0
/457:   read(82, "01", 4096)                            = 1
/457:   read(82, 0xFFFFFC7FD35FDE90, 4096)              Err#11 EAGAIN
/457:   write(87, "01", 1)                              = 1
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 1
/121:   read(82, "01", 4096)                            = 1
/121:   read(82, 0xFFFFFC7FDF1EEC10, 4096)              Err#11 EAGAIN
/121:   write(87, "01", 1)                              = 1
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 1
/457:   lwp_park(0xFFFFFC7FD35FEF50, 0) (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330) (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330) (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 0
/121:   lwp_unpark(457)                                 = 0
/457:   lwp_park(0xFFFFFC7FD35FEF50, 0)                 = 0
/457:   getrandom("A8D6C5 R &D48BF297A1F0BB".., 16, 0)  = 16
/457:   getuid()                                        = 501 [501]
/457:   getuid()                                        = 501 [501]
/457:   getuid()                                        = 501 [501]
/457:   door_info(86, 0xFFFFFC7FD35FCB60)               = 0
/457:   door_call(86, 0xFFFFFC7FD35FCBE0)               = 0
/457:   stat("/home/dap/.pgpass", 0xFFFFFC7FD35FD4C0)   Err#2 ENOENT
/457:   so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 3
/457:   setsockopt(3, tcp, TCP_NODELAY, 0xFFFFFC7FD35FD288, 4, SOV_DEFAULT) = 0
/457:   fcntl(3, F_GETFL)                               = 2
/457:   fcntl(3, F_SETFL, FWRITE|FNONBLOCK)             = 0
/457:   fcntl(3, F_SETFD, 0x00000001)                   = 0
/457:   setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, 0xFFFFFC7FD35FD284, 4, SOV_DEFAULT) = 0
/457:   connect(3, 0x13C05CF0, 16, SOV_DEFAULT)         Err#146 ECONNREFUSED
/457:   close(3)                                        = 0
/457:   read(82, "01", 4096)                            = 1
/457:   read(82, 0xFFFFFC7FD35FDE90, 4096)              Err#11 EAGAIN
/457:   write(87, "01", 1)                              = 1
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330)          = 1
/457:   lwp_park(0xFFFFFC7FD35FEF50, 0) (sleeping...)
/121:   ioctl(76, DP_POLL, 0xFFFFFC7FDF1F1330) (sleeping...)
/60:    lwp_park(0xFFFFFC7FEDF27E60, 0)                 Err#62 ETIME
/60:    lwp_park(0xFFFFFC7FEDF27E60, 0) (sleeping...)

It seems to be trying to connect to something. At this point I realized we should still have the log file! With luck, it might be the only one we're currently writing to, so it would be the most recently modified one:

$ ls -lrt $TMPDIR
...
-rw-r--r--   1 dap      staff     109731 May 27 11:59 test_all-d586ea57740e3382-test_disk_not_found_before_creation.10286.21.log
$ 

Perfect. How about it:

$ bunyan $TMPDIR/test_all-d586ea57740e3382-test_disk_not_found_before_creation.10286.21.log
[2022-05-27T18:52:49.61546721Z]  INFO: test_disk_not_found_before_creation/10286 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/.tmpLYH1Md
[2022-05-27T18:52:49.615845698Z]  INFO: test_disk_not_found_before_creation/10286 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-308521ed0d0eed98/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/.tmpLYH1Md/data)
[2022-05-27T18:52:49.638721972Z]  INFO: test_disk_not_found_before_creation/10286 on ivanova: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/.tmpLYH1Md/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/.tmpLYH1Md/listen-url
[2022-05-27T18:52:50.103556318Z]  INFO: test_disk_not_found_before_creation/10286 on ivanova: cockroach pid: 10622
[2022-05-27T18:52:50.104064581Z]  INFO: test_disk_not_found_before_creation/10286 on ivanova: cockroach listen URL: postgresql://[email protected]:53734/omicron?sslmode=disable
[2022-05-27T18:52:50.67671281Z]  INFO: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/10286 on ivanova: setting up nexus server
...
...
[2022-05-27T18:52:50.835685791Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/dropshot_internal/10286 on ivanova: registered endpoint (local_addr=127.0.0.1:34947, path=/zpools/{zpool_id}/dataset/{dataset_id}, method=PUT)
[2022-05-27T18:52:50.836490947Z]  INFO: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/dropshot_external/10286 on ivanova: listening (local_addr=127.0.0.1:59840)
[2022-05-27T18:52:50.836915532Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/dropshot_external/10286 on ivanova: successfully registered DTrace USDT probes (local_addr=127.0.0.1:59840)
[2022-05-27T18:52:50.837323425Z]  INFO: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/dropshot_internal/10286 on ivanova: listening (local_addr=127.0.0.1:34947)
[2022-05-27T18:52:50.838038541Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/dropshot_internal/10286 on ivanova: successfully registered DTrace USDT probes (local_addr=127.0.0.1:34947)
[2022-05-27T18:52:50.838579025Z]  INFO: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: SEC running (sec_id=4aa6dc47-0947-46ad-9b2f-5d22d24b79cc)
[2022-05-27T18:52:50.839303188Z] TRACE: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: authorize begin (resource=Database, action=Modify)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-05-27T18:52:50.84011652Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: roles (roles="RoleSet { roles: {} }")
[2022-05-27T18:52:50.840771091Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: authorize result (result=Ok(()), resource=Database, action=Modify)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-05-27T18:52:50.841552863Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: attempting to create built-in users
[2022-05-27T18:52:50.842223003Z] TRACE: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: authorize begin (resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-05-27T18:52:50.84310236Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: roles (roles="RoleSet { roles: {} }")
[2022-05-27T18:52:50.8437564Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: authorize result (result=Ok(()), resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-05-27T18:52:50.844639574Z]  INFO: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: start saga recovery
[2022-05-27T18:52:50.861148068Z] TRACE: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: listing sagas
[2022-05-27T18:53:20.789345835Z]  WARN: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: failed to populate built-in PopulateBuiltinUsers; will retry in 239.675139ms
    error_message: ServiceUnavailable { internal_message: "Timeout accessing connection pool" }
[2022-05-27T18:53:20.790022177Z]  WARN: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: failed to list sagas (will retry after 281.24234ms): Service Unavailable: Timeout accessing connection pool
...
[2022-05-27T18:58:05.394430047Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: attempting to create built-in users
[2022-05-27T18:58:05.394781714Z] TRACE: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: authorize begin (resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-05-27T18:58:05.395139242Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: roles (roles="RoleSet { roles: {} }")
[2022-05-27T18:58:05.395447185Z] DEBUG: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: authorize result (result=Ok(()), resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-05-27T18:58:35.396827299Z]  WARN: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: failed to populate built-in PopulateBuiltinUsers; will retry in 81.672887042s
    error_message: ServiceUnavailable { internal_message: "Timeout accessing connection pool" }
[2022-05-27T18:59:22.27520407Z] TRACE: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: listing sagas
[2022-05-27T18:59:52.278448731Z]  WARN: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: failed to list sagas (will retry after 184.972150589s): Service Unavailable: Timeout accessing connection pool

We seem to be having trouble connecting to the database. (And you'd see that in the pstack output if you looked more closely than I did.)

Can we connect to the database now? The URL is in the log above, near the top.

$ cockroach sql --url postgresql://[email protected]:53734/omicron?sslmode=disable
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

dial tcp 127.0.0.1:53734: connect: connection refused
Failed running "sql"

Strange. It doesn't seem to be running:

$ pgrep -f cockroach
$

CockroachDB's temporary directory is also in the log output. What's in there?

$ ls /dangerzone/omicron_tmp/.tmpLYH1Md
total 10
-rw-r--r--   1 dap      staff        710 May 27 11:52 cockroachdb_stderr
-rw-r--r--   1 dap      staff        885 May 27 11:52 cockroachdb_stdout
drwxr-xr-x   5 dap      staff         22 May 27 11:52 data/
-rw-r-----   1 dap      staff         60 May 27 11:52 listen-url

Anything good in the output files?

$ cat /dangerzone/omicron_tmp/.tmpLYH1Md/cockroachdb_stderr 
*
* WARNING: ALL SECURITY CONTROLS HAVE BEEN DISABLED!
* 
* This mode is intended for non-production testing only.
* 
* In this mode:
* - Your cluster is open to any client that can access 127.0.0.1.
* - Intruders with access to your machine or network can observe client-server traffic.
* - Intruders can log in without password and read or write any data in the cluster.
* - Intruders can consume all your server's resources and cause unavailability.
*
*
* INFO: To start a secure server without mandating TLS for clients,
* consider --accept-sql-without-tls instead. For other options, see:
* 
* - https://go.crdb.dev/issue-v/53404/v21.2
* - https://www.cockroachlabs.com/docs/v21.2/secure-a-cluster.html
*
$ cat /dangerzone/omicron_tmp/.tmpLYH1Md/cockroachdb_stdout
CockroachDB node starting at 2022-05-27 18:52:50.142511011 +0000 UTC (took 0.1s)
build:               OSS v21.2.9 @ 2022/04/28 04:02:42 (go1.16.10)
webui:               http://127.0.0.1:36270
sql:                 postgresql://[email protected]:53734/defaultdb?sslmode=disable
sql (JDBC):          jdbc:postgresql://127.0.0.1:53734/defaultdb?sslmode=disable&user=root
RPC client flags:    cockroach <client cmd> --host=127.0.0.1:53734 --insecure
logs:                /dangerzone/omicron_tmp/.tmpLYH1Md/data/logs
temp dir:            /dangerzone/omicron_tmp/.tmpLYH1Md/data/cockroach-temp577904075
external I/O path:   /dangerzone/omicron_tmp/.tmpLYH1Md/data/extern
store[0]:            path=/dangerzone/omicron_tmp/.tmpLYH1Md/data
storage engine:      pebble
status:              restarted pre-existing node
clusterID:           07a6884d-c511-4db8-bf1f-90fce7b2abf2
nodeID:              1
$

No obvious sign of a crash. I took a look at the log file with less /dangerzone/omicron_tmp/.tmpLYH1Md/data/logs/cockroach.log and didn't see anything obviously broken. Are we sure it's gone?

$ fuser /dangerzone/omicron_tmp/.tmpLYH1Md/data/logs/cockroach.log 
/dangerzone/omicron_tmp/.tmpLYH1Md/data/logs/cockroach.log: 

Yeah, nothing's even got the log file open and it's last entry was when it started, many minutes ago:

$ tail -n 20 /dangerzone/omicron_tmp/.tmpLYH1Md/data/logs/cockroach.log
I220527 18:52:50.132019 788 server/auto_upgrade.go:55 ⋮ [n1] 51  no need to upgrade, cluster already at the newest version
I220527 18:52:50.137502 570 kv/kvserver/replica_command.go:411 ⋮ [n1,split,s1,r115/1:‹/{Table/107-Max}›] 52  initiating a split of this range at key ‹/Table/108› [r116] (‹span config›)‹›
I220527 18:52:50.142469 23 1@server/server.go:2217 ⋮ [n1] 53  serving sql connections
I220527 18:52:50.142599 23 1@cli/start.go:749 ⋮ [config] 54  clusterID: ‹07a6884d-c511-4db8-bf1f-90fce7b2abf2›
I220527 18:52:50.142612 23 1@cli/start.go:749 ⋮ [config] 55  nodeID: n1
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56  node startup completed:
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +CockroachDB node starting at 2022-05-27 18:52:50.142511011 +0000 UTC (took 0.1s)
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +build:               OSS v21.2.9 @ 2022/04/28 04:02:42 (go1.16.10)
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +webui:               ‹http://127.0.0.1:36270›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +sql:                 ‹postgresql://[email protected]:53734/defaultdb?sslmode=disable›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +sql (JDBC):          ‹jdbc:postgresql://127.0.0.1:53734/defaultdb?sslmode=disable&user=root›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +RPC client flags:    ‹cockroach <client cmd> --host=127.0.0.1:53734 --insecure›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +logs:                ‹/dangerzone/omicron_tmp/.tmpLYH1Md/data/logs›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +temp dir:            ‹/dangerzone/omicron_tmp/.tmpLYH1Md/data/cockroach-temp577904075›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +external I/O path:   ‹/dangerzone/omicron_tmp/.tmpLYH1Md/data/extern›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +store[0]:            ‹path=/dangerzone/omicron_tmp/.tmpLYH1Md/data›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +storage engine:      pebble
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +status:              restarted pre-existing node
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +clusterID:           ‹07a6884d-c511-4db8-bf1f-90fce7b2abf2›
I220527 18:52:50.142640 23 1@cli/start.go:759 ⋮ [-] 56 +nodeID:              1
$ TZ=utc ps -ostime -p 10286
   STIME
18:52:45

(more coming -- have to break up this GitHub comment)

@davepacheco
Copy link
Collaborator Author

At this point I took a look at that ptree output again with an eye towards that <defunct> process:

      24786  -bash
        7622   /home/dap/.rustup/toolchains/nightly-2022-04-27-x86_64-unknown-illumos/bin/carg
          10286  /home/dap/omicron/target/debug/deps/test_all-d586ea57740e3382
            10622  <defunct>
            10653  clickhouse server --log-file /dangerzone/omicron_tmp/.tmpPfAddb/clickhouse-serv

What was that? I don't know any way to get information about defunct processes except through the kernel debugger. I believe these are process that have exited but not been waited-on. The kernel throws away a lot of the state, I think. The userland tools don't say much:

$ pargs 10622
10622: <defunct>
$ ps -opid,args -p 10622
  PID COMMAND
10622 <defunct>
$

but the kernel debugger can help!

$ pfexec mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata ip hook neti sockfs arp usba xhci sd stmf stmf_sbd mm lofs random crypto cpc ufs logindmux ptm nfs ]
> 0t10622::pid2proc | ::walk thread | ::stacks
> 0t10622::pid2proc | ::ps
S PID    PPID   PGID   SID    UID    FLAGS      ADDR             NAME
Z 10622  10286  7622   24786  501    0x4a014002 fffffe5fb9b36030 cockroach

Aha! That was CockroachDB. What happened to it? I wonder if we can find anything in its proc_t:

> 0t10622::pid2proc | ::print -at proc_t
fffffe5fb9b36030 proc_t {
    fffffe5fb9b36030 struct vnode *p_exec = 0
    fffffe5fb9b36038 struct as *p_as = kas
    fffffe5fb9b36040 struct plock *p_lockp = 0xfffffe5932866540
    fffffe5fb9b36048 kmutex_t p_crlock = {
        fffffe5fb9b36048 void *[1] _opaque = [ 0 ]
    }
    fffffe5fb9b36050 struct cred *p_cred = 0xfffffe5adf758bc0
    fffffe5fb9b36058 int p_swapcnt = 0
    fffffe5fb9b3605c char p_stat = '\003'
    fffffe5fb9b3605d char p_wcode = '\001'
    fffffe5fb9b3605e ushort_t p_pidflag = 0
    fffffe5fb9b36060 int p_wdata = 0x2
...

p_stat, p_wcode, and p_wdata -- those look interesting. Is that related to the exit status codes provided by the wait(2) family? In the illumos source, I looked for references to p_stat and found an assignment with value SZOMB. So I looked for the definition of SZOMB (to find the value of it and other values for this field). Our value is \003, which is SZOMB -- "process terminated but not waited for". Well, we knew that. But at least that matches what I thought.

I did a similar search for p_wcode, found this assignment, looked for CLD_CONTINUED, and found these definitions. Our value is 1, which is CLD_EXITED -- "child has exited". p_what is a little vaguer to me but it appears to be the exit status passed to exit(2), which in this case was 0x2.

At this point I went further into the database directory looking for any recently-written files.

 $ cd /dangerzone/omicron_tmp/.tmpLYH1Md/data/logs
$ ls -lrt
total 272
drwxr-xr-x   2 dap      staff          2 May 27 11:52 goroutine_dump/
drwxr-xr-x   2 dap      staff          3 May 27 11:52 heap_profiler/
-rw-r-----   1 dap      staff      46720 May 27 11:52 cockroach-sql-schema.log
-rw-r-----   1 dap      staff       2485 May 27 11:52 cockroach-health.ivanova.dap.2022-05-27T18_47_55Z.009526.log
drwxr-xr-x   2 dap      staff          2 May 27 11:52 inflight_trace_dump/
-rw-r-----   1 dap      staff     235036 May 27 11:52 cockroach.ivanova.dap.2022-05-27T18_47_53Z.009526.log
-rw-r-----   1 dap      staff      46720 May 27 11:52 cockroach-sql-schema.ivanova.dap.2022-05-27T18_47_54Z.009526.log
-rw-r-----   1 dap      staff       2836 May 27 11:52 cockroach-pebble.ivanova.dap.2022-05-27T18_47_53Z.009526.log
-rw-r-----   1 dap      staff       2393 May 27 11:52 cockroach-security.log
-rw-r-----   1 dap      staff       1147 May 27 11:52 cockroach-stderr.ivanova.dap.2022-05-27T18_47_53Z.009526.log
-rw-r-----   1 dap      staff       2393 May 27 11:52 cockroach-security.ivanova.dap.2022-05-27T18_47_54Z.009526.log
lrwxrwxrwx   1 dap      staff         60 May 27 11:52 cockroach-stderr.log -> cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log
lrwxrwxrwx   1 dap      staff         53 May 27 11:52 cockroach.log -> cockroach.ivanova.dap.2022-05-27T18_52_49Z.010622.log
lrwxrwxrwx   1 dap      staff         60 May 27 11:52 cockroach-pebble.log -> cockroach-pebble.ivanova.dap.2022-05-27T18_52_49Z.010622.log
lrwxrwxrwx   1 dap      staff         60 May 27 11:52 cockroach-health.log -> cockroach-health.ivanova.dap.2022-05-27T18_52_50Z.010622.log
-rw-r-----   1 dap      staff       1141 May 27 11:52 cockroach-health.ivanova.dap.2022-05-27T18_52_50Z.010622.log
-rw-r-----   1 dap      staff       4576 May 27 11:52 cockroach-pebble.ivanova.dap.2022-05-27T18_52_49Z.010622.log
-rw-r-----   1 dap      staff      13191 May 27 11:52 cockroach.ivanova.dap.2022-05-27T18_52_49Z.010622.log
-rw-r-----   1 dap      staff     287626 May 27 11:52 cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log

What's in that most recently written log file?

$ tail cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log

goroutine 853 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc0005fe2d0, 0x1, 0x0, 0x0, 0x0, 0x0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:406 +0xff
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc001f4a780, 0x0, 0x0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:533 +0x1dd
google.golang.org/grpc/internal/transport.newHTTP2Client.func3(0xc0028921e0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_client.go:396 +0x7b
created by google.golang.org/grpc/internal/transport.newHTTP2Client
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_client.go:394 +0x12ae

and I think we have our culprit. I'll have to attach the file separately because it's too big to put in a GitHub comment. But the relevant line is:

I220527 18:52:49.994016 1 util/log/flags.go:201  [-] 1  stderr capture started
runtime: port_getn on fd 4 failed (errno=22)
fatal error: runtime: netpoll failed

What's errno 22?

#define	EINVAL	22	/* Invalid argument			*/

Yikes. That sounds like a bad misuse of event ports but I don't know how to tell more given what's here. Had CockroachDB aborted, we'd have a core file, and we'd know what fd 4 was. We'd also potentially know what it had passed to port_getn.

So what's the summary of problems here?

  • most root cause we found: CockroachDB got EINVAL from port_getn and exited
  • maybe: CockroachDB should abort in this case instead of exit?
  • the test did not notice this, so it's waiting indefinitely to establish a TCP connection to CockroachDB, which will never work. This would be nice to improve but I'm not sure how concretely to do that.

It's conceivable that this problem is related to the 30-second CockroachDB startup timeouts that we've occasionally seen in CI, but I don't think we've seen those on illumos systems and this problem is very likely illumos-specific (since it's an illumos-specific interface that it's invoking and getting EINVAL).

@davepacheco
Copy link
Collaborator Author

Full log: cockroach-stderr-log.gz

@davepacheco
Copy link
Collaborator Author

Full CockroachDB directory, a tarball inside a ZIP file because GitHub doesn't support .tgz:
cockroachdb-directory.tgz.zip

@davepacheco
Copy link
Collaborator Author

I used the command-line that we logged in the file to start CockroachDB again by hand:

$ cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/.tmpLYH1Md/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/.tmpLYH1Md/listen-url
*
* WARNING: ALL SECURITY CONTROLS HAVE BEEN DISABLED!
* 
* This mode is intended for non-production testing only.
* 
* In this mode:
* - Your cluster is open to any client that can access 127.0.0.1.
* - Intruders with access to your machine or network can observe client-server traffic.
* - Intruders can log in without password and read or write any data in the cluster.
* - Intruders can consume all your server's resources and cause unavailability.
*
*
* INFO: To start a secure server without mandating TLS for clients,
* consider --accept-sql-without-tls instead. For other options, see:
* 
* - https://go.crdb.dev/issue-v/53404/v21.2
* - https://www.cockroachlabs.com/docs/v21.2/secure-a-cluster.html
*
CockroachDB node starting at 2022-05-27 19:53:34.029190437 +0000 UTC (took 0.1s)
build:               OSS v21.2.9 @ 2022/04/28 04:02:42 (go1.16.10)
webui:               http://127.0.0.1:49618
sql:                 postgresql://[email protected]:47460/defaultdb?sslmode=disable
sql (JDBC):          jdbc:postgresql://127.0.0.1:47460/defaultdb?sslmode=disable&user=root
RPC client flags:    cockroach <client cmd> --host=127.0.0.1:47460 --insecure
logs:                /dangerzone/omicron_tmp/.tmpLYH1Md/data/logs
temp dir:            /dangerzone/omicron_tmp/.tmpLYH1Md/data/cockroach-temp464243076
external I/O path:   /dangerzone/omicron_tmp/.tmpLYH1Md/data/extern
store[0]:            path=/dangerzone/omicron_tmp/.tmpLYH1Md/data
storage engine:      pebble
status:              restarted pre-existing node
clusterID:           07a6884d-c511-4db8-bf1f-90fce7b2abf2
nodeID:              1

then saved a "debug zip" so we can report this to CockroachDB:

$ cockroach debug zip --url postgresql://[email protected]:47460/defaultdb?sslmode=disable --insecure cockroachdb-zip.zip
warning: --url specifies user/password, but command "zip" does not accept user/password details - details ignored
warning: --url specifies database "defaultdb", but command "zip" does not accept a database name - database name ignored
[cluster] establishing RPC connection to 127.0.0.1:47460...
[cluster] establishing RPC connection to 127.0.0.1:47460: done
[cluster] retrieving the node status to get the SQL address...
[cluster] retrieving the node status to get the SQL address: done
[cluster] using SQL address: 127.0.0.1:47460...
[cluster] using SQL address: 127.0.0.1:47460: using SQL connection URL: postgresql://[email protected]:47460/system?application_name=%24+cockroach+zip&connect_timeout=15&sslmode=disable...
[cluster] using SQL address: 127.0.0.1:47460: done
[cluster] creating output file cockroachdb-zip.zip...
[cluster] creating output file cockroachdb-zip.zip: done
[cluster] requesting data for debug/events...
[cluster] requesting data for debug/events: received response...
[cluster] requesting data for debug/events: converting to JSON...
[cluster] requesting data for debug/events: writing binary output: debug/events.json...
[cluster] requesting data for debug/events: done
[cluster] requesting data for debug/rangelog...
[cluster] requesting data for debug/rangelog: received response...
[cluster] requesting data for debug/rangelog: converting to JSON...
[cluster] requesting data for debug/rangelog: writing binary output: debug/rangelog.json...
[cluster] requesting data for debug/rangelog: done
[cluster] requesting data for debug/settings...
[cluster] requesting data for debug/settings: received response...
[cluster] requesting data for debug/settings: converting to JSON...
[cluster] requesting data for debug/settings: writing binary output: debug/settings.json...
[cluster] requesting data for debug/settings: done
[cluster] requesting data for debug/reports/problemranges...
[cluster] requesting data for debug/reports/problemranges: received response...
[cluster] requesting data for debug/reports/problemranges: converting to JSON...
[cluster] requesting data for debug/reports/problemranges: writing binary output: debug/reports/problemranges.json...
[cluster] requesting data for debug/reports/problemranges: done
[cluster] retrieving SQL data for crdb_internal.cluster_contention_events...
[cluster] retrieving SQL data for crdb_internal.cluster_contention_events: writing output: debug/crdb_internal.cluster_contention_events.txt...
[cluster] retrieving SQL data for crdb_internal.cluster_contention_events: done
[cluster] retrieving SQL data for crdb_internal.cluster_distsql_flows...
[cluster] retrieving SQL data for crdb_internal.cluster_distsql_flows: writing output: debug/crdb_internal.cluster_distsql_flows.txt...
[cluster] retrieving SQL data for crdb_internal.cluster_distsql_flows: done
[cluster] retrieving SQL data for crdb_internal.cluster_database_privileges...
[cluster] retrieving SQL data for crdb_internal.cluster_database_privileges: writing output: debug/crdb_internal.cluster_database_privileges.txt...
[cluster] retrieving SQL data for crdb_internal.cluster_database_privileges: done
[cluster] retrieving SQL data for crdb_internal.cluster_queries...
[cluster] retrieving SQL data for crdb_internal.cluster_queries: writing output: debug/crdb_internal.cluster_queries.txt...
[cluster] retrieving SQL data for crdb_internal.cluster_queries: done
[cluster] retrieving SQL data for crdb_internal.cluster_sessions...
[cluster] retrieving SQL data for crdb_internal.cluster_sessions: writing output: debug/crdb_internal.cluster_sessions.txt...
[cluster] retrieving SQL data for crdb_internal.cluster_sessions: done
[cluster] retrieving SQL data for crdb_internal.cluster_settings...
[cluster] retrieving SQL data for crdb_internal.cluster_settings: writing output: debug/crdb_internal.cluster_settings.txt...
[cluster] retrieving SQL data for crdb_internal.cluster_settings: done
[cluster] retrieving SQL data for crdb_internal.cluster_transactions...
[cluster] retrieving SQL data for crdb_internal.cluster_transactions: writing output: debug/crdb_internal.cluster_transactions.txt...
[cluster] retrieving SQL data for crdb_internal.cluster_transactions: done
[cluster] retrieving SQL data for crdb_internal.default_privileges...
[cluster] retrieving SQL data for crdb_internal.default_privileges: writing output: debug/crdb_internal.default_privileges.txt...
[cluster] retrieving SQL data for crdb_internal.default_privileges: done
[cluster] retrieving SQL data for crdb_internal.jobs...
[cluster] retrieving SQL data for crdb_internal.jobs: writing output: debug/crdb_internal.jobs.txt...
[cluster] retrieving SQL data for crdb_internal.jobs: done
[cluster] retrieving SQL data for system.jobs...
[cluster] retrieving SQL data for system.jobs: writing output: debug/system.jobs.txt...
[cluster] retrieving SQL data for system.jobs: done
[cluster] retrieving SQL data for system.descriptor...
[cluster] retrieving SQL data for system.descriptor: writing output: debug/system.descriptor.txt...
[cluster] retrieving SQL data for system.descriptor: done
[cluster] retrieving SQL data for system.namespace...
[cluster] retrieving SQL data for system.namespace: writing output: debug/system.namespace.txt...
[cluster] retrieving SQL data for system.namespace: done
[cluster] retrieving SQL data for system.scheduled_jobs...
[cluster] retrieving SQL data for system.scheduled_jobs: writing output: debug/system.scheduled_jobs.txt...
[cluster] retrieving SQL data for system.scheduled_jobs: done
[cluster] retrieving SQL data for system.settings...
[cluster] retrieving SQL data for system.settings: writing output: debug/system.settings.txt...
[cluster] retrieving SQL data for system.settings: done
[cluster] retrieving SQL data for system.replication_stats...
[cluster] retrieving SQL data for system.replication_stats: writing output: debug/system.replication_stats.txt...
[cluster] retrieving SQL data for system.replication_stats: done
[cluster] retrieving SQL data for system.replication_critical_localities...
[cluster] retrieving SQL data for system.replication_critical_localities: writing output: debug/system.replication_critical_localities.txt...
[cluster] retrieving SQL data for system.replication_critical_localities: done
[cluster] retrieving SQL data for system.replication_constraint_stats...
[cluster] retrieving SQL data for system.replication_constraint_stats: writing output: debug/system.replication_constraint_stats.txt...
[cluster] retrieving SQL data for system.replication_constraint_stats: done
[cluster] retrieving SQL data for "".crdb_internal.create_statements...
[cluster] retrieving SQL data for "".crdb_internal.create_statements: writing output: debug/crdb_internal.create_statements.txt...
[cluster] retrieving SQL data for "".crdb_internal.create_statements: done
[cluster] retrieving SQL data for "".crdb_internal.create_type_statements...
[cluster] retrieving SQL data for "".crdb_internal.create_type_statements: writing output: debug/crdb_internal.create_type_statements.txt...
[cluster] retrieving SQL data for "".crdb_internal.create_type_statements: done
[cluster] retrieving SQL data for crdb_internal.kv_node_liveness...
[cluster] retrieving SQL data for crdb_internal.kv_node_liveness: writing output: debug/crdb_internal.kv_node_liveness.txt...
[cluster] retrieving SQL data for crdb_internal.kv_node_liveness: done
[cluster] retrieving SQL data for crdb_internal.kv_node_status...
[cluster] retrieving SQL data for crdb_internal.kv_node_status: writing output: debug/crdb_internal.kv_node_status.txt...
[cluster] retrieving SQL data for crdb_internal.kv_node_status: done
[cluster] retrieving SQL data for crdb_internal.kv_store_status...
[cluster] retrieving SQL data for crdb_internal.kv_store_status: writing output: debug/crdb_internal.kv_store_status.txt...
[cluster] retrieving SQL data for crdb_internal.kv_store_status: done
[cluster] retrieving SQL data for crdb_internal.regions...
[cluster] retrieving SQL data for crdb_internal.regions: writing output: debug/crdb_internal.regions.txt...
[cluster] retrieving SQL data for crdb_internal.regions: done
[cluster] retrieving SQL data for crdb_internal.schema_changes...
[cluster] retrieving SQL data for crdb_internal.schema_changes: writing output: debug/crdb_internal.schema_changes.txt...
[cluster] retrieving SQL data for crdb_internal.schema_changes: done
[cluster] retrieving SQL data for crdb_internal.partitions...
[cluster] retrieving SQL data for crdb_internal.partitions: writing output: debug/crdb_internal.partitions.txt...
[cluster] retrieving SQL data for crdb_internal.partitions: done
[cluster] retrieving SQL data for crdb_internal.zones...
[cluster] retrieving SQL data for crdb_internal.zones: writing output: debug/crdb_internal.zones.txt...
[cluster] retrieving SQL data for crdb_internal.zones: done
[cluster] retrieving SQL data for crdb_internal.invalid_objects...
[cluster] retrieving SQL data for crdb_internal.invalid_objects: writing output: debug/crdb_internal.invalid_objects.txt...
[cluster] retrieving SQL data for crdb_internal.invalid_objects: done
[cluster] retrieving SQL data for crdb_internal.index_usage_statistics...
[cluster] retrieving SQL data for crdb_internal.index_usage_statistics: writing output: debug/crdb_internal.index_usage_statistics.txt...
[cluster] retrieving SQL data for crdb_internal.index_usage_statistics: done
[cluster] requesting nodes...
[cluster] requesting nodes: received response...
[cluster] requesting nodes: converting to JSON...
[cluster] requesting nodes: writing binary output: debug/nodes.json...
[cluster] requesting nodes: done
[cluster] requesting liveness...
[cluster] requesting liveness: received response...
[cluster] requesting liveness: converting to JSON...
[cluster] requesting liveness: writing binary output: debug/liveness.json...
[cluster] requesting liveness: done
[cluster] requesting CPU profiles
[cluster] profiles generated
[cluster] profile for node 1...
[cluster] profile for node 1: writing binary output: debug/nodes/1/cpu.pprof...
[cluster] profile for node 1: done
[node 1] node status...
[node 1] node status: converting to JSON...
[node 1] node status: writing binary output: debug/nodes/1/status.json...
[node 1] node status: done
[node 1] using SQL connection URL: postgresql://[email protected]:47460/system?application_name=%24+cockroach+zip&connect_timeout=15&sslmode=disable
[node 1] retrieving SQL data for crdb_internal.feature_usage...
[node 1] retrieving SQL data for crdb_internal.feature_usage: writing output: debug/nodes/1/crdb_internal.feature_usage.txt...
[node 1] retrieving SQL data for crdb_internal.feature_usage: done
[node 1] retrieving SQL data for crdb_internal.gossip_alerts...
[node 1] retrieving SQL data for crdb_internal.gossip_alerts: writing output: debug/nodes/1/crdb_internal.gossip_alerts.txt...
[node 1] retrieving SQL data for crdb_internal.gossip_alerts: done
[node 1] retrieving SQL data for crdb_internal.gossip_liveness...
[node 1] retrieving SQL data for crdb_internal.gossip_liveness: writing output: debug/nodes/1/crdb_internal.gossip_liveness.txt...
[node 1] retrieving SQL data for crdb_internal.gossip_liveness: done
[node 1] retrieving SQL data for crdb_internal.gossip_network...
[node 1] retrieving SQL data for crdb_internal.gossip_network: writing output: debug/nodes/1/crdb_internal.gossip_network.txt...
[node 1] retrieving SQL data for crdb_internal.gossip_network: done
[node 1] retrieving SQL data for crdb_internal.gossip_nodes...
[node 1] retrieving SQL data for crdb_internal.gossip_nodes: writing output: debug/nodes/1/crdb_internal.gossip_nodes.txt...
[node 1] retrieving SQL data for crdb_internal.gossip_nodes: done
[node 1] retrieving SQL data for crdb_internal.leases...
[node 1] retrieving SQL data for crdb_internal.leases: writing output: debug/nodes/1/crdb_internal.leases.txt...
[node 1] retrieving SQL data for crdb_internal.leases: done
[node 1] retrieving SQL data for crdb_internal.node_build_info...
[node 1] retrieving SQL data for crdb_internal.node_build_info: writing output: debug/nodes/1/crdb_internal.node_build_info.txt...
[node 1] retrieving SQL data for crdb_internal.node_build_info: done
[node 1] retrieving SQL data for crdb_internal.node_contention_events...
[node 1] retrieving SQL data for crdb_internal.node_contention_events: writing output: debug/nodes/1/crdb_internal.node_contention_events.txt...
[node 1] retrieving SQL data for crdb_internal.node_contention_events: done
[node 1] retrieving SQL data for crdb_internal.node_distsql_flows...
[node 1] retrieving SQL data for crdb_internal.node_distsql_flows: writing output: debug/nodes/1/crdb_internal.node_distsql_flows.txt...
[node 1] retrieving SQL data for crdb_internal.node_distsql_flows: done
[node 1] retrieving SQL data for crdb_internal.node_inflight_trace_spans...
[node 1] retrieving SQL data for crdb_internal.node_inflight_trace_spans: writing output: debug/nodes/1/crdb_internal.node_inflight_trace_spans.txt...
[node 1] retrieving SQL data for crdb_internal.node_inflight_trace_spans: done
[node 1] retrieving SQL data for crdb_internal.node_metrics...
[node 1] retrieving SQL data for crdb_internal.node_metrics: writing output: debug/nodes/1/crdb_internal.node_metrics.txt...
[node 1] retrieving SQL data for crdb_internal.node_metrics: done
[node 1] retrieving SQL data for crdb_internal.node_queries...
[node 1] retrieving SQL data for crdb_internal.node_queries: writing output: debug/nodes/1/crdb_internal.node_queries.txt...
[node 1] retrieving SQL data for crdb_internal.node_queries: done
[node 1] retrieving SQL data for crdb_internal.node_runtime_info...
[node 1] retrieving SQL data for crdb_internal.node_runtime_info: writing output: debug/nodes/1/crdb_internal.node_runtime_info.txt...
[node 1] retrieving SQL data for crdb_internal.node_runtime_info: done
[node 1] retrieving SQL data for crdb_internal.node_sessions...
[node 1] retrieving SQL data for crdb_internal.node_sessions: writing output: debug/nodes/1/crdb_internal.node_sessions.txt...
[node 1] retrieving SQL data for crdb_internal.node_sessions: done
[node 1] retrieving SQL data for crdb_internal.node_statement_statistics...
[node 1] retrieving SQL data for crdb_internal.node_statement_statistics: writing output: debug/nodes/1/crdb_internal.node_statement_statistics.txt...
[node 1] retrieving SQL data for crdb_internal.node_statement_statistics: done
[node 1] retrieving SQL data for crdb_internal.node_transaction_statistics...
[node 1] retrieving SQL data for crdb_internal.node_transaction_statistics: writing output: debug/nodes/1/crdb_internal.node_transaction_statistics.txt...
[node 1] retrieving SQL data for crdb_internal.node_transaction_statistics: done
[node 1] retrieving SQL data for crdb_internal.node_transactions...
[node 1] retrieving SQL data for crdb_internal.node_transactions: writing output: debug/nodes/1/crdb_internal.node_transactions.txt...
[node 1] retrieving SQL data for crdb_internal.node_transactions: done
[node 1] retrieving SQL data for crdb_internal.node_txn_stats...
[node 1] retrieving SQL data for crdb_internal.node_txn_stats: writing output: debug/nodes/1/crdb_internal.node_txn_stats.txt...
[node 1] retrieving SQL data for crdb_internal.node_txn_stats: done
[node 1] retrieving SQL data for crdb_internal.active_range_feeds...
[node 1] retrieving SQL data for crdb_internal.active_range_feeds: writing output: debug/nodes/1/crdb_internal.active_range_feeds.txt...
[node 1] retrieving SQL data for crdb_internal.active_range_feeds: done
[node 1] requesting data for debug/nodes/1/details...
[node 1] requesting data for debug/nodes/1/details: received response...
[node 1] requesting data for debug/nodes/1/details: converting to JSON...
[node 1] requesting data for debug/nodes/1/details: writing binary output: debug/nodes/1/details.json...
[node 1] requesting data for debug/nodes/1/details: done
[node 1] requesting data for debug/nodes/1/gossip...
[node 1] requesting data for debug/nodes/1/gossip: received response...
[node 1] requesting data for debug/nodes/1/gossip: converting to JSON...
[node 1] requesting data for debug/nodes/1/gossip: writing binary output: debug/nodes/1/gossip.json...
[node 1] requesting data for debug/nodes/1/gossip: done
[node 1] requesting data for debug/nodes/1/enginestats...
[node 1] requesting data for debug/nodes/1/enginestats: received response...
[node 1] requesting data for debug/nodes/1/enginestats: converting to JSON...
[node 1] requesting data for debug/nodes/1/enginestats: writing binary output: debug/nodes/1/enginestats.json...
[node 1] requesting data for debug/nodes/1/enginestats: done
[node 1] requesting stacks...
[node 1] requesting stacks: received response...
[node 1] requesting stacks: writing binary output: debug/nodes/1/stacks.txt...
[node 1] requesting stacks: done
[node 1] requesting heap profile...
[node 1] requesting heap profile: received response...
[node 1] requesting heap profile: writing binary output: debug/nodes/1/heap.pprof...
[node 1] requesting heap profile: done
[node 1] requesting heap file list...
[node 1] requesting heap file list: received response...
[node 1] requesting heap file list: done
[node 1] 4 heap profiles found
[node 1] retrieving memprof.2022-05-27T18_48_04.034.44498864.pprof...
[node 1] retrieving memprof.2022-05-27T18_48_04.034.44498864.pprof: received response...
[node 1] retrieving memprof.2022-05-27T18_48_04.034.44498864.pprof: done
[node 1] writing profile...
[node 1] writing profile: writing binary output: debug/nodes/1/heapprof/memprof.2022-05-27T18_48_04.034.44498864.pprof...
[node 1] writing profile: done
[node 1] retrieving memprof.2022-05-27T19_53_44.005.41259256.pprof...
[node 1] retrieving memprof.2022-05-27T19_53_44.005.41259256.pprof: received response...
[node 1] retrieving memprof.2022-05-27T19_53_44.005.41259256.pprof: done
[node 1] writing profile...
[node 1] writing profile: writing binary output: debug/nodes/1/heapprof/memprof.2022-05-27T19_53_44.005.41259256.pprof...
[node 1] writing profile: done
[node 1] retrieving memprof.2022-05-27T19_54_04.005.48303432.pprof...
[node 1] retrieving memprof.2022-05-27T19_54_04.005.48303432.pprof: received response...
[node 1] retrieving memprof.2022-05-27T19_54_04.005.48303432.pprof: done
[node 1] writing profile...
[node 1] writing profile: writing binary output: debug/nodes/1/heapprof/memprof.2022-05-27T19_54_04.005.48303432.pprof...
[node 1] writing profile: done
[node 1] retrieving memprof.2022-05-27T19_54_24.004.51794584.pprof...
[node 1] retrieving memprof.2022-05-27T19_54_24.004.51794584.pprof: received response...
[node 1] retrieving memprof.2022-05-27T19_54_24.004.51794584.pprof: done
[node 1] writing profile...
[node 1] writing profile: writing binary output: debug/nodes/1/heapprof/memprof.2022-05-27T19_54_24.004.51794584.pprof...
[node 1] writing profile: done
[node 1] requesting goroutine dump list...
[node 1] requesting goroutine dump list: received response...
[node 1] requesting goroutine dump list: done
[node 1] 0 goroutine dumps found
[node 1] requesting log files list...
[node 1] requesting log files list: received response...
[node 1] requesting log files list: done
[node 1] 15 log files found
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: received response...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: writing output: debug/nodes/1/logs/cockroach-stderr.ivanova.dap.2022-05-27T18_47_53Z.009526.log...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: done
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: received response...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: last request failed: rpc error: code = Unknown desc = decoding on line 8: malformed log entry
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: creating error output: debug/nodes/1/logs/cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log.err.txt...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: done
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: received response...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: writing output: debug/nodes/1/logs/cockroach-stderr.ivanova.dap.2022-05-27T19_53_33Z.011708.log...
[node 1] [log file: cockroach-stderr.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: done
[node 1] [log file: cockroach-security.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file...
[node 1] [log file: cockroach-security.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file: received response...
[node 1] [log file: cockroach-security.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file: writing output: debug/nodes/1/logs/cockroach-security.ivanova.dap.2022-05-27T18_47_54Z.009526.log...
[node 1] [log file: cockroach-security.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file: done
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_47_55Z.009526.log] requesting file...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_47_55Z.009526.log] requesting file: received response...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_47_55Z.009526.log] requesting file: writing output: debug/nodes/1/logs/cockroach-health.ivanova.dap.2022-05-27T18_47_55Z.009526.log...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_47_55Z.009526.log] requesting file: done
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_52_50Z.010622.log] requesting file...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_52_50Z.010622.log] requesting file: received response...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_52_50Z.010622.log] requesting file: writing output: debug/nodes/1/logs/cockroach-health.ivanova.dap.2022-05-27T18_52_50Z.010622.log...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T18_52_50Z.010622.log] requesting file: done
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file: received response...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file: writing output: debug/nodes/1/logs/cockroach-health.ivanova.dap.2022-05-27T19_53_34Z.011708.log...
[node 1] [log file: cockroach-health.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file: done
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: received response...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: writing output: debug/nodes/1/logs/cockroach-pebble.ivanova.dap.2022-05-27T18_47_53Z.009526.log...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: done
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: received response...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: writing output: debug/nodes/1/logs/cockroach-pebble.ivanova.dap.2022-05-27T18_52_49Z.010622.log...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: done
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: received response...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: writing output: debug/nodes/1/logs/cockroach-pebble.ivanova.dap.2022-05-27T19_53_33Z.011708.log...
[node 1] [log file: cockroach-pebble.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: done
[node 1] [log file: cockroach-sql-schema.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file...
[node 1] [log file: cockroach-sql-schema.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file: received response...
[node 1] [log file: cockroach-sql-schema.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file: writing output: debug/nodes/1/logs/cockroach-sql-schema.ivanova.dap.2022-05-27T18_47_54Z.009526.log...
[node 1] [log file: cockroach-sql-schema.ivanova.dap.2022-05-27T18_47_54Z.009526.log] requesting file: done
[node 1] [log file: cockroach-telemetry.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file...
[node 1] [log file: cockroach-telemetry.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file: received response...
[node 1] [log file: cockroach-telemetry.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file: writing output: debug/nodes/1/logs/cockroach-telemetry.ivanova.dap.2022-05-27T19_53_34Z.011708.log...
[node 1] [log file: cockroach-telemetry.ivanova.dap.2022-05-27T19_53_34Z.011708.log] requesting file: done
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: received response...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: writing output: debug/nodes/1/logs/cockroach.ivanova.dap.2022-05-27T18_47_53Z.009526.log...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_47_53Z.009526.log] requesting file: done
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: received response...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: writing output: debug/nodes/1/logs/cockroach.ivanova.dap.2022-05-27T18_52_49Z.010622.log...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T18_52_49Z.010622.log] requesting file: done
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: received response...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: writing output: debug/nodes/1/logs/cockroach.ivanova.dap.2022-05-27T19_53_33Z.011708.log...
[node 1] [log file: cockroach.ivanova.dap.2022-05-27T19_53_33Z.011708.log] requesting file: done
[node 1] requesting ranges...
[node 1] requesting ranges: received response...
[node 1] requesting ranges: done
[node 1] 102 ranges found
[node 1] writing range 1...
[node 1] writing range 1: converting to JSON...
[node 1] writing range 1: writing binary output: debug/nodes/1/ranges/1.json...
[node 1] writing range 1: done
[node 1] writing range 2...
[node 1] writing range 2: converting to JSON...
[node 1] writing range 2: writing binary output: debug/nodes/1/ranges/2.json...
[node 1] writing range 2: done
[node 1] writing range 3...
[node 1] writing range 3: converting to JSON...
[node 1] writing range 3: writing binary output: debug/nodes/1/ranges/3.json...
[node 1] writing range 3: done
[node 1] writing range 4...
[node 1] writing range 4: converting to JSON...
[node 1] writing range 4: writing binary output: debug/nodes/1/ranges/4.json...
[node 1] writing range 4: done
[node 1] writing range 5...
[node 1] writing range 5: converting to JSON...
[node 1] writing range 5: writing binary output: debug/nodes/1/ranges/5.json...
[node 1] writing range 5: done
[node 1] writing range 6...
[node 1] writing range 6: converting to JSON...
[node 1] writing range 6: writing binary output: debug/nodes/1/ranges/6.json...
[node 1] writing range 6: done
[node 1] writing range 7...
[node 1] writing range 7: converting to JSON...
[node 1] writing range 7: writing binary output: debug/nodes/1/ranges/7.json...
[node 1] writing range 7: done
[node 1] writing range 8...
[node 1] writing range 8: converting to JSON...
[node 1] writing range 8: writing binary output: debug/nodes/1/ranges/8.json...
[node 1] writing range 8: done
[node 1] writing range 9...
[node 1] writing range 9: converting to JSON...
[node 1] writing range 9: writing binary output: debug/nodes/1/ranges/9.json...
[node 1] writing range 9: done
[node 1] writing range 10...
[node 1] writing range 10: converting to JSON...
[node 1] writing range 10: writing binary output: debug/nodes/1/ranges/10.json...
[node 1] writing range 10: done
[node 1] writing range 11...
[node 1] writing range 11: converting to JSON...
[node 1] writing range 11: writing binary output: debug/nodes/1/ranges/11.json...
[node 1] writing range 11: done
[node 1] writing range 12...
[node 1] writing range 12: converting to JSON...
[node 1] writing range 12: writing binary output: debug/nodes/1/ranges/12.json...
[node 1] writing range 12: done
[node 1] writing range 13...
[node 1] writing range 13: converting to JSON...
[node 1] writing range 13: writing binary output: debug/nodes/1/ranges/13.json...
[node 1] writing range 13: done
[node 1] writing range 14...
[node 1] writing range 14: converting to JSON...
[node 1] writing range 14: writing binary output: debug/nodes/1/ranges/14.json...
[node 1] writing range 14: done
[node 1] writing range 15...
[node 1] writing range 15: converting to JSON...
[node 1] writing range 15: writing binary output: debug/nodes/1/ranges/15.json...
[node 1] writing range 15: done
[node 1] writing range 16...
[node 1] writing range 16: converting to JSON...
[node 1] writing range 16: writing binary output: debug/nodes/1/ranges/16.json...
[node 1] writing range 16: done
[node 1] writing range 17...
[node 1] writing range 17: converting to JSON...
[node 1] writing range 17: writing binary output: debug/nodes/1/ranges/17.json...
[node 1] writing range 17: done
[node 1] writing range 18...
[node 1] writing range 18: converting to JSON...
[node 1] writing range 18: writing binary output: debug/nodes/1/ranges/18.json...
[node 1] writing range 18: done
[node 1] writing range 19...
[node 1] writing range 19: converting to JSON...
[node 1] writing range 19: writing binary output: debug/nodes/1/ranges/19.json...
[node 1] writing range 19: done
[node 1] writing range 20...
[node 1] writing range 20: converting to JSON...
[node 1] writing range 20: writing binary output: debug/nodes/1/ranges/20.json...
[node 1] writing range 20: done
[node 1] writing range 21...
[node 1] writing range 21: converting to JSON...
[node 1] writing range 21: writing binary output: debug/nodes/1/ranges/21.json...
[node 1] writing range 21: done
[node 1] writing range 22...
[node 1] writing range 22: converting to JSON...
[node 1] writing range 22: writing binary output: debug/nodes/1/ranges/22.json...
[node 1] writing range 22: done
[node 1] writing range 23...
[node 1] writing range 23: converting to JSON...
[node 1] writing range 23: writing binary output: debug/nodes/1/ranges/23.json...
[node 1] writing range 23: done
[node 1] writing range 24...
[node 1] writing range 24: converting to JSON...
[node 1] writing range 24: writing binary output: debug/nodes/1/ranges/24.json...
[node 1] writing range 24: done
[node 1] writing range 25...
[node 1] writing range 25: converting to JSON...
[node 1] writing range 25: writing binary output: debug/nodes/1/ranges/25.json...
[node 1] writing range 25: done
[node 1] writing range 26...
[node 1] writing range 26: converting to JSON...
[node 1] writing range 26: writing binary output: debug/nodes/1/ranges/26.json...
[node 1] writing range 26: done
[node 1] writing range 27...
[node 1] writing range 27: converting to JSON...
[node 1] writing range 27: writing binary output: debug/nodes/1/ranges/27.json...
[node 1] writing range 27: done
[node 1] writing range 28...
[node 1] writing range 28: converting to JSON...
[node 1] writing range 28: writing binary output: debug/nodes/1/ranges/28.json...
[node 1] writing range 28: done
[node 1] writing range 29...
[node 1] writing range 29: converting to JSON...
[node 1] writing range 29: writing binary output: debug/nodes/1/ranges/29.json...
[node 1] writing range 29: done
[node 1] writing range 30...
[node 1] writing range 30: converting to JSON...
[node 1] writing range 30: writing binary output: debug/nodes/1/ranges/30.json...
[node 1] writing range 30: done
[node 1] writing range 31...
[node 1] writing range 31: converting to JSON...
[node 1] writing range 31: writing binary output: debug/nodes/1/ranges/31.json...
[node 1] writing range 31: done
[node 1] writing range 32...
[node 1] writing range 32: converting to JSON...
[node 1] writing range 32: writing binary output: debug/nodes/1/ranges/32.json...
[node 1] writing range 32: done
[node 1] writing range 33...
[node 1] writing range 33: converting to JSON...
[node 1] writing range 33: writing binary output: debug/nodes/1/ranges/33.json...
[node 1] writing range 33: done
[node 1] writing range 34...
[node 1] writing range 34: converting to JSON...
[node 1] writing range 34: writing binary output: debug/nodes/1/ranges/34.json...
[node 1] writing range 34: done
[node 1] writing range 35...
[node 1] writing range 35: converting to JSON...
[node 1] writing range 35: writing binary output: debug/nodes/1/ranges/35.json...
[node 1] writing range 35: done
[node 1] writing range 36...
[node 1] writing range 36: converting to JSON...
[node 1] writing range 36: writing binary output: debug/nodes/1/ranges/36.json...
[node 1] writing range 36: done
[node 1] writing range 37...
[node 1] writing range 37: converting to JSON...
[node 1] writing range 37: writing binary output: debug/nodes/1/ranges/37.json...
[node 1] writing range 37: done
[node 1] writing range 38...
[node 1] writing range 38: converting to JSON...
[node 1] writing range 38: writing binary output: debug/nodes/1/ranges/38.json...
[node 1] writing range 38: done
[node 1] writing range 39...
[node 1] writing range 39: converting to JSON...
[node 1] writing range 39: writing binary output: debug/nodes/1/ranges/39.json...
[node 1] writing range 39: done
[node 1] writing range 40...
[node 1] writing range 40: converting to JSON...
[node 1] writing range 40: writing binary output: debug/nodes/1/ranges/40.json...
[node 1] writing range 40: done
[node 1] writing range 41...
[node 1] writing range 41: converting to JSON...
[node 1] writing range 41: writing binary output: debug/nodes/1/ranges/41.json...
[node 1] writing range 41: done
[node 1] writing range 42...
[node 1] writing range 42: converting to JSON...
[node 1] writing range 42: writing binary output: debug/nodes/1/ranges/42.json...
[node 1] writing range 42: done
[node 1] writing range 43...
[node 1] writing range 43: converting to JSON...
[node 1] writing range 43: writing binary output: debug/nodes/1/ranges/43.json...
[node 1] writing range 43: done
[node 1] writing range 44...
[node 1] writing range 44: converting to JSON...
[node 1] writing range 44: writing binary output: debug/nodes/1/ranges/44.json...
[node 1] writing range 44: done
[node 1] writing range 45...
[node 1] writing range 45: converting to JSON...
[node 1] writing range 45: writing binary output: debug/nodes/1/ranges/45.json...
[node 1] writing range 45: done
[node 1] writing range 46...
[node 1] writing range 46: converting to JSON...
[node 1] writing range 46: writing binary output: debug/nodes/1/ranges/46.json...
[node 1] writing range 46: done
[node 1] writing range 47...
[node 1] writing range 47: converting to JSON...
[node 1] writing range 47: writing binary output: debug/nodes/1/ranges/47.json...
[node 1] writing range 47: done
[node 1] writing range 48...
[node 1] writing range 48: converting to JSON...
[node 1] writing range 48: writing binary output: debug/nodes/1/ranges/48.json...
[node 1] writing range 48: done
[node 1] writing range 49...
[node 1] writing range 49: converting to JSON...
[node 1] writing range 49: writing binary output: debug/nodes/1/ranges/49.json...
[node 1] writing range 49: done
[node 1] writing range 50...
[node 1] writing range 50: converting to JSON...
[node 1] writing range 50: writing binary output: debug/nodes/1/ranges/50.json...
[node 1] writing range 50: done
[node 1] writing range 51...
[node 1] writing range 51: converting to JSON...
[node 1] writing range 51: writing binary output: debug/nodes/1/ranges/51.json...
[node 1] writing range 51: done
[node 1] writing range 52...
[node 1] writing range 52: converting to JSON...
[node 1] writing range 52: writing binary output: debug/nodes/1/ranges/52.json...
[node 1] writing range 52: done
[node 1] writing range 53...
[node 1] writing range 53: converting to JSON...
[node 1] writing range 53: writing binary output: debug/nodes/1/ranges/53.json...
[node 1] writing range 53: done
[node 1] writing range 54...
[node 1] writing range 54: converting to JSON...
[node 1] writing range 54: writing binary output: debug/nodes/1/ranges/54.json...
[node 1] writing range 54: done
[node 1] writing range 55...
[node 1] writing range 55: converting to JSON...
[node 1] writing range 55: writing binary output: debug/nodes/1/ranges/55.json...
[node 1] writing range 55: done
[node 1] writing range 56...
[node 1] writing range 56: converting to JSON...
[node 1] writing range 56: writing binary output: debug/nodes/1/ranges/56.json...
[node 1] writing range 56: done
[node 1] writing range 57...
[node 1] writing range 57: converting to JSON...
[node 1] writing range 57: writing binary output: debug/nodes/1/ranges/57.json...
[node 1] writing range 57: done
[node 1] writing range 58...
[node 1] writing range 58: converting to JSON...
[node 1] writing range 58: writing binary output: debug/nodes/1/ranges/58.json...
[node 1] writing range 58: done
[node 1] writing range 59...
[node 1] writing range 59: converting to JSON...
[node 1] writing range 59: writing binary output: debug/nodes/1/ranges/59.json...
[node 1] writing range 59: done
[node 1] writing range 60...
[node 1] writing range 60: converting to JSON...
[node 1] writing range 60: writing binary output: debug/nodes/1/ranges/60.json...
[node 1] writing range 60: done
[node 1] writing range 61...
[node 1] writing range 61: converting to JSON...
[node 1] writing range 61: writing binary output: debug/nodes/1/ranges/61.json...
[node 1] writing range 61: done
[node 1] writing range 62...
[node 1] writing range 62: converting to JSON...
[node 1] writing range 62: writing binary output: debug/nodes/1/ranges/62.json...
[node 1] writing range 62: done
[node 1] writing range 63...
[node 1] writing range 63: converting to JSON...
[node 1] writing range 63: writing binary output: debug/nodes/1/ranges/63.json...
[node 1] writing range 63: done
[node 1] writing range 64...
[node 1] writing range 64: converting to JSON...
[node 1] writing range 64: writing binary output: debug/nodes/1/ranges/64.json...
[node 1] writing range 64: done
[node 1] writing range 65...
[node 1] writing range 65: converting to JSON...
[node 1] writing range 65: writing binary output: debug/nodes/1/ranges/65.json...
[node 1] writing range 65: done
[node 1] writing range 66...
[node 1] writing range 66: converting to JSON...
[node 1] writing range 66: writing binary output: debug/nodes/1/ranges/66.json...
[node 1] writing range 66: done
[node 1] writing range 67...
[node 1] writing range 67: converting to JSON...
[node 1] writing range 67: writing binary output: debug/nodes/1/ranges/67.json...
[node 1] writing range 67: done
[node 1] writing range 68...
[node 1] writing range 68: converting to JSON...
[node 1] writing range 68: writing binary output: debug/nodes/1/ranges/68.json...
[node 1] writing range 68: done
[node 1] writing range 69...
[node 1] writing range 69: converting to JSON...
[node 1] writing range 69: writing binary output: debug/nodes/1/ranges/69.json...
[node 1] writing range 69: done
[node 1] writing range 70...
[node 1] writing range 70: converting to JSON...
[node 1] writing range 70: writing binary output: debug/nodes/1/ranges/70.json...
[node 1] writing range 70: done
[node 1] writing range 71...
[node 1] writing range 71: converting to JSON...
[node 1] writing range 71: writing binary output: debug/nodes/1/ranges/71.json...
[node 1] writing range 71: done
[node 1] writing range 72...
[node 1] writing range 72: converting to JSON...
[node 1] writing range 72: writing binary output: debug/nodes/1/ranges/72.json...
[node 1] writing range 72: done
[node 1] writing range 73...
[node 1] writing range 73: converting to JSON...
[node 1] writing range 73: writing binary output: debug/nodes/1/ranges/73.json...
[node 1] writing range 73: done
[node 1] writing range 74...
[node 1] writing range 74: converting to JSON...
[node 1] writing range 74: writing binary output: debug/nodes/1/ranges/74.json...
[node 1] writing range 74: done
[node 1] writing range 75...
[node 1] writing range 75: converting to JSON...
[node 1] writing range 75: writing binary output: debug/nodes/1/ranges/75.json...
[node 1] writing range 75: done
[node 1] writing range 76...
[node 1] writing range 76: converting to JSON...
[node 1] writing range 76: writing binary output: debug/nodes/1/ranges/76.json...
[node 1] writing range 76: done
[node 1] writing range 77...
[node 1] writing range 77: converting to JSON...
[node 1] writing range 77: writing binary output: debug/nodes/1/ranges/77.json...
[node 1] writing range 77: done
[node 1] writing range 78...
[node 1] writing range 78: converting to JSON...
[node 1] writing range 78: writing binary output: debug/nodes/1/ranges/78.json...
[node 1] writing range 78: done
[node 1] writing range 79...
[node 1] writing range 79: converting to JSON...
[node 1] writing range 79: writing binary output: debug/nodes/1/ranges/79.json...
[node 1] writing range 79: done
[node 1] writing range 80...
[node 1] writing range 80: converting to JSON...
[node 1] writing range 80: writing binary output: debug/nodes/1/ranges/80.json...
[node 1] writing range 80: done
[node 1] writing range 81...
[node 1] writing range 81: converting to JSON...
[node 1] writing range 81: writing binary output: debug/nodes/1/ranges/81.json...
[node 1] writing range 81: done
[node 1] writing range 82...
[node 1] writing range 82: converting to JSON...
[node 1] writing range 82: writing binary output: debug/nodes/1/ranges/82.json...
[node 1] writing range 82: done
[node 1] writing range 83...
[node 1] writing range 83: converting to JSON...
[node 1] writing range 83: writing binary output: debug/nodes/1/ranges/83.json...
[node 1] writing range 83: done
[node 1] writing range 84...
[node 1] writing range 84: converting to JSON...
[node 1] writing range 84: writing binary output: debug/nodes/1/ranges/84.json...
[node 1] writing range 84: done
[node 1] writing range 85...
[node 1] writing range 85: converting to JSON...
[node 1] writing range 85: writing binary output: debug/nodes/1/ranges/85.json...
[node 1] writing range 85: done
[node 1] writing range 86...
[node 1] writing range 86: converting to JSON...
[node 1] writing range 86: writing binary output: debug/nodes/1/ranges/86.json...
[node 1] writing range 86: done
[node 1] writing range 87...
[node 1] writing range 87: converting to JSON...
[node 1] writing range 87: writing binary output: debug/nodes/1/ranges/87.json...
[node 1] writing range 87: done
[node 1] writing range 88...
[node 1] writing range 88: converting to JSON...
[node 1] writing range 88: writing binary output: debug/nodes/1/ranges/88.json...
[node 1] writing range 88: done
[node 1] writing range 89...
[node 1] writing range 89: converting to JSON...
[node 1] writing range 89: writing binary output: debug/nodes/1/ranges/89.json...
[node 1] writing range 89: done
[node 1] writing range 90...
[node 1] writing range 90: converting to JSON...
[node 1] writing range 90: writing binary output: debug/nodes/1/ranges/90.json...
[node 1] writing range 90: done
[node 1] writing range 91...
[node 1] writing range 91: converting to JSON...
[node 1] writing range 91: writing binary output: debug/nodes/1/ranges/91.json...
[node 1] writing range 91: done
[node 1] writing range 92...
[node 1] writing range 92: converting to JSON...
[node 1] writing range 92: writing binary output: debug/nodes/1/ranges/92.json...
[node 1] writing range 92: done
[node 1] writing range 93...
[node 1] writing range 93: converting to JSON...
[node 1] writing range 93: writing binary output: debug/nodes/1/ranges/93.json...
[node 1] writing range 93: done
[node 1] writing range 94...
[node 1] writing range 94: converting to JSON...
[node 1] writing range 94: writing binary output: debug/nodes/1/ranges/94.json...
[node 1] writing range 94: done
[node 1] writing range 95...
[node 1] writing range 95: converting to JSON...
[node 1] writing range 95: writing binary output: debug/nodes/1/ranges/95.json...
[node 1] writing range 95: done
[node 1] writing range 96...
[node 1] writing range 96: converting to JSON...
[node 1] writing range 96: writing binary output: debug/nodes/1/ranges/96.json...
[node 1] writing range 96: done
[node 1] writing range 97...
[node 1] writing range 97: converting to JSON...
[node 1] writing range 97: writing binary output: debug/nodes/1/ranges/97.json...
[node 1] writing range 97: done
[node 1] writing range 98...
[node 1] writing range 98: converting to JSON...
[node 1] writing range 98: writing binary output: debug/nodes/1/ranges/98.json...
[node 1] writing range 98: done
[node 1] writing range 99...
[node 1] writing range 99: converting to JSON...
[node 1] writing range 99: writing binary output: debug/nodes/1/ranges/99.json...
[node 1] writing range 99: done
[node 1] writing range 114...
[node 1] writing range 114: converting to JSON...
[node 1] writing range 114: writing binary output: debug/nodes/1/ranges/114.json...
[node 1] writing range 114: done
[node 1] writing range 115...
[node 1] writing range 115: converting to JSON...
[node 1] writing range 115: writing binary output: debug/nodes/1/ranges/115.json...
[node 1] writing range 115: done
[node 1] writing range 124...
[node 1] writing range 124: converting to JSON...
[node 1] writing range 124: writing binary output: debug/nodes/1/ranges/124.json...
[node 1] writing range 124: done
[cluster] pprof summary script...
[cluster] pprof summary script: writing binary output: debug/pprof-summary.sh...
[cluster] pprof summary script: done
[cluster] hot range summary script...
[cluster] hot range summary script: writing binary output: debug/hot-ranges.sh...
[cluster] hot range summary script: done
$

That's attached here: cockroachdb-zip.zip.

Just for fun, I tweaked the CockroachDB command-line so that it would listen on the listen-url that was in our test's log file and started it up. The end of the log file currently includes:

[2022-05-27T19:43:09.052772299Z]  WARN: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: failed to populate built-in PopulateBuiltinUsers; will retry in 2027.307201027s
    error_message: ServiceUnavailable { internal_message: "Timeout accessing connection pool" }
[2022-05-27T19:53:43.439336872Z] TRACE: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: listing sagas
[2022-05-27T19:54:13.440856692Z]  WARN: 4aa6dc47-0947-46ad-9b2f-5d22d24b79cc/ServerContext/10286 on ivanova: failed to list sagas (will retry after 3316.235744587s): Service Unavailable: Timeout accessing connection pool

I hope that if I wait about 50 minutes, the test should complete successfully.

@davepacheco
Copy link
Collaborator Author

Maybe related? golang/go#45643 CC @jclulow

@davepacheco
Copy link
Collaborator Author

Following up on my previous-previous comment: after starting CockroachDB on the right port, that test did eventually uncork and complete successfully! It happened around 20:16:57, earlier than I'd hoped, because we were only blocked on the populate step, not the saga list. That checks out -- that's when the populate step was scheduled to be retried.

@davepacheco davepacheco changed the title test_disk_not_found_before_creation apparently hung after cockroachdb crashed cockroachdb crashed during test run May 27, 2022
@davepacheco davepacheco changed the title cockroachdb crashed during test run cockroachdb crashed during test run, reports EINVAL from port_getn Jun 1, 2022
@davepacheco
Copy link
Collaborator Author

davepacheco commented Jun 1, 2022

Last week, we dug in a little bit and found:

  • the only documented cases where port_getn returns EINVAL are very clear invalid-args cases ("the timeout element tv_sec is < 0 or the timeout element tv_nsec is < 0 or > 1000000000" and "the list[] argument is NULL, the nget argument is NULL, or the content of nget is > max and max is > 0."
  • a somewhat cursory inspection of the kernel code paths invoked from port_getn revealed no other case where port_getn would return EINVAL
  • Here is the port_getn caller in Golang v1.16.10:
    • "events" is always a reference to a stack-declared array, so I don't see how that would ever be NULL
    • "wait" (the timeout structure) is a little more complicated. For that, we need the arguments to netpoll. From the stack that Golang printed when it crashed, the arguments were runtime.netpoll(0x0, 0x0). I'm not exactly sure what that means because this function only accepts one argument, delay. If that was 0, then "wait" should be a pointer to a timespec on the stack that's never explicitly initialized. I gather structures are zero-filled in Go by default, so this
      should have been a valid structure indicating a timeout of 0 seconds and 0 nanoseconds.
    • "max" is always the length of "events", which is always 128
    • "nget" is always a pointer to an int containing 1.

Based on the Go code, I don't see how the first invocation of port_getn by netpoll could result in EINVAL. In certain cases, we hit a goto at L256 that causes us to invoke port_getn again. This happens if port_getn returns a negative value (indicating an error) and the error is either EINTR or ETIME and the delay argument to netpoll was zero or negative. I gather this is saying: if we were interrupted by a signal or timeout [and we've previously checked that if it was a timeout then no events were returned], then try again. In this case, all the arguments are the same as before. (I initially thought the kernel might modify the timeout argument, but the manual page doesn't mention this and from looking at the code paths I don't think it does.) So I don't see how the Go code should be able to trigger EINVAL from port_getn.

There are a few possibilities here:

  • I'm wrong in my read of the Go and kernel code paths and there is a way to get this code to trigger EINVAL
  • there's something sneakier going on triggering this (e.g., something is clobbering an argument unexpectedly with a value that's invalid)
  • the report of EINVAL itself is wrong, as sometimes happens if code is built improperly without -D_REENTRANT or the like

There are a couple of angles here:

  • look at Golang changes since this (since v1.16.10) to see if there have been other fixes -- I thought there was an event ports change more recently than that
  • instrument with DTrace and try to reproduce by running the test in a loop and hoping for the best

I've been running this DTrace script to try to collect more data if this happens again (edit: the previous version of this script had a bug in the execname predicate and I've fixed it below):

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option destructive
#pragma D option bufpolicy=ring

/* Trace returns of syscalls from CockroachDB returning with errno = EINVAL. */
syscall:::return
/execname == "cockroach" && errno == EINVAL/
{
        printf(
            "%Y.%09d pid %6d tid %4d syscall %s returning %d (errno = EINVAL)\n",
            walltimestamp, walltimestamp % 1000000000, pid, tid, probefunc,
            arg1);
}

/* Trace entry and return for every call to "port_getn" from CockroachDB. */
fbt::port_getn:entry
/execname == "cockroach"/
{
        printf("%Y.%09d pid %6d tid %4d port_getn  entry\n",
                walltimestamp, walltimestamp % 1000000000, pid, tid);

}

fbt::port_getn:return
/execname == "cockroach"/
{
        printf("%Y.%09d pid %6d tid %4d port_getn returning %d\n",
                walltimestamp, walltimestamp % 1000000000, pid, tid, arg1);

}

/*
 * Stop CockroachDB and this script if we ever see it write out the string that
 * indicates it's about to exit.
 */
syscall::write:entry
/execname == "cockroach" &&
 strstr(copyinstr(arg1, arg2), "netpoll failed") != NULL/
{
        printf("cockroachdb (pid %d) about to exit", pid);
        stop();
        exit(0);
}

/*
 * Do the same if we ever see port_getn return EINVAL.  This would be a better
 * case to catch because we'd have more stuff on the stack.
 */
fbt::port_getn:return
/execname == "cockroach" && arg1 == EINVAL/
{
        printf("port_getn returned EINVAL (pid %d)", pid);
        stop();
        exit(0);
}

The intent is that if port_getn ever returns EINVAL to cockroachdb, we stop CockroachDB so we can inspect it. We also trace all calls to port_getn from CockroachDB to see what they return and what thread made them. We're using bufpolicy=wrong so that we only write out the last of these before the DTrace script terminates. I'm hopeful that this will confirm or rule out some possibilities here: if we can catch this again, we should either see port_getn returning EINVAL or not; if so, we may have enough state to see exactly what arguments were passed. If not, we'll know something else is going on.

@davepacheco
Copy link
Collaborator Author

davepacheco commented Jun 1, 2022

With the above D script running, I've been trying to reproduce this with:

#!/bin/bash

TMPDIR=$TMPDIR/try_repro.$$
mkdir -p $TMPDIR

cd nexus
for ((i = 0; i >= 0; i++)) {
	echo ATTEMPT $i;
	../target/debug/deps/test_all-d586ea57740e3382 \
	    test_disk_create_disk_that_already_exists_fails || break
}

which I run as:

nohup ./try_repro.sh > try_repro.out 2>&1 &

So far, I've only caught #1144 instead.

@davepacheco
Copy link
Collaborator Author

I successfully reproduced this issue after 1328 iterations. But there was a bug in the predicate of my D script that caused it to not ever fire. I've fixed that and am trying again.

@davepacheco davepacheco changed the title cockroachdb crashed during test run, reports EINVAL from port_getn cockroachdb crashed in Go runtime after reporting EINVAL from port_getn Jun 2, 2022
@davepacheco
Copy link
Collaborator Author

Okay, I managed to reproduce this with the right D enabling:

ATTEMPT 10274

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails has been running for over 60 seconds

The D script exited and emitted:

$ tail -n 30 dtrace-einval.out 
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   15 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   15 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   15 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   13 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   18 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   18 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   18 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   18 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   15 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   15 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid   12 port_getn returning 0
2022 Jun  2 13:22:45.934004332 pid   1347 tid    1 syscall lwp_detach returning -1 (errno = EINVAL)
2022 Jun  2 13:22:45.934004332 pid   1347 tid    3 port_getn  entry
2022 Jun  2 13:22:45.934004332 pid   1347 tid    3 port_getn returning 22
port_getn returned EINVAL (pid 1347)

This confirms that the kernel really did return EINVAL from port_getn. The D script stopped the process and I saved arguments (with pargs), the environment (with pargs -e), open files (with pfiles), memory mappings (with pmap and pmap -x), and a core file. The process remains stopped for the moment but I've mostly been working on the core file.

In terms of the core file, here are the registers of the stopped thread:

> ::status
debugging core file of cockroach (64-bit) from ivanova
file: /home/dap/tools/cockroachdb-v21.2.9/bin/cockroach
initial argv: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/
threading model: native threads
status: process core file generated with gcore(1)
> ::regs
%rax = 0x0000000000000016       %r8  = 0x0000000000000001
%rbx = 0xfffffc7feb7fef70       %r9  = 0xfffffc7feb7fef98
%rcx = 0x0000000000000080       %r10 = 0x0000000000000080
%rdx = 0xfffffc7feb7fefc0       %r11 = 0x0000000000c92099
%rsi = 0x0000000000000004       %r12 = 0x0000000000000000
%rdi = 0x0000000000000006       %r13 = 0x000000c000084e30
                                %r14 = 0x00000000060836f0
                                %r15 = 0x0000000000000038

%cs = 0x0053    %fs = 0x0000    %gs = 0x0000
%ds = 0x004b    %es = 0x004b    %ss = 0x004b

%rip = 0xfffffc7feeb5741a libc.so.1`_portfs+0xa
%rbp = 0xfffffc7feb7fee70
%rsp = 0xfffffc7feb7fee28

%rflags = 0x00000283
  id=0 vip=0 vif=0 ac=0 vm=0 rf=0 nt=0 iopl=0x0
  status=<of,df,IF,tf,SF,zf,af,pf,CF>

%gsbase = 0x0000000000000000
%fsbase = 0xfffffc7fef040ac0
%trapno = 0x400
   %err = 0xfb8005f8

From the kernel implementation of save_syscall_args(), I infer that the syscall arguments go in %rdi, %rsi, %rdx, %rcx, %r8, and %r9 (in that order). Here are their values (from the output above), along with how port_getn interprets them:

rdi = 0x6 
rsi = 0x4 
rdx = 0xfffffc7feb7fefc0
rcx = 0x80
r8  = 0x0000000000000001
r9  = 0xfffffc7feb7fef98

opcode & PORT_CODE_MASK should be PORT_GETN 
a0 => event port
a1 => &list of events
a2 => max 
a3 => nget
a4 => &timeout struct

opcode = 0x6 should be PORT_GETN when & PORT_CODE_MASK
a0 = 0x4 should be event port fd
a1 = 0xfffffc7feb7fefc0 should be a list of events
a2 = 0x80 should be max 
a3 = 0x0000000000000001 should be nget
a4 = 0xfffffc7feb7fef98 should be pointer to timeout struct

opcode checks out: PORT_CODE_MASK is 0xff.  PORT_GETN = 6 
a0 = 0x4 checks out: supposed to be an event port fd, and it is (and it's the only event port fd for the process, based on the `pfiles` output)
a1 = list of events, well, it at least looks like a pointer
a2 = 0x80 = 128, which is the right "max"
a3 = 1, which is "nget", which is right
a4 = 0xfffffc7feb7fef98
that looks like garbage:
> 0xfffffc7feb7fef98::print struct timespec
{
    tv_sec = c001c97500
    tv_nsec = 0xc000240000
}

So everything looks right except for the timespec, and given the timespec's current value, the EINVAL is expected.

Here's the stack:

> $C
fffffc7feb7fee70 libc.so.1`_portfs+0xa()
fffffc7feb7fef28 runtime.asmsysvicall6+0x5a()
fffffc7feb7ffbc0 runtime.netpoll+0xc5()
fffffc7feb7ffce8 runtime.findrunnable+0xf72()
fffffc7feb7ffd50 runtime.schedule+0x2d7()
fffffc7feb7ffd88 runtime.preemptPark+0xb4()
fffffc7feb7fff38 runtime.newstack+0x2ee()
000000c002702598 runtime.morestack+0xa1()
000000c0027026b0 github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).destroy+0x150()
000000c0027026d8 github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).Close+0x74()
000000c002702898 github.com/cockroachdb/cockroach/pkg/storage.MVCCGet+0x29b()
000000c002702978 github.com/cockroachdb/cockroach/pkg/storage.MVCCGetProto+0xd9()
000000c002702a70 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetOrCreateReplica.func1+0xfd()
000000c002702c00 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetOrCreateReplica+0xb3b()
000000c002702dd8 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).getOrCreateReplica+0x1f8()
000000c002702e80 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).acquireSplitLock+0xb6()
000000c002702ec8 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).maybeAcquireSplitMergeLock+0xdf()
000000c0027031f0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).Stage+0x238()
000000c002703238 github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.Batch.Stage-fm+0x4c()
000000c0027032c0 github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCmdIter+0x142()
000000c002703398 github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch+0x185()
000000c002703420 github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries+0xc5()
000000c002703b90 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked+0x100d()
000000c002703cd0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady+0x11c()
000000c002703e78 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady+0x145()
000000c002703ef8 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker+0x2c2()
000000c002703f20 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker-fm+0x47()
000000c002703fa0 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2+0xf3()
0000000000000000 runtime.goexit+1()

We appear to have useful frame pointers, fortunately. Note that the timespec address is on the stack inside the "netpoll" frame, which makes sense. According to the pmap output, its address is also within thread 3's stack, which is consistent with DTrace saying thread 3 was the one we stopped.

From the Go code, we'd expect the stack to include the actual, zero'd timespec; the pointer to it; and an array of 128 port_events. I'd expect:

fffffc7feb7fef98  time_t c001c97500 (8 bytes)
fffffc7feb7fefa0  long   c000240000 (8 bytes)
fffffc7feb7fefa8  ??? (0x18 = 24 bytes)
fffffc7feb7fefc0  port_event_t array, each being 0x18 = 24 bytes

In fact:

> fffffc7feb7fef98,0t4/np
0xfffffc7feb7fef98:
                0xc001c97500
                0xc000240000
                0xfffffc7feb7fef98
                0

It looks like "wait" got put right after "ts" on the stack, which is surprising but fine. That explains what's in between.

It looks from the disassembly like we initialize the struct timespec by xor'ing %xmm0 with itself, then putting the result onto the stack at -0x60(%rsp) -- fine.

It sure looks as though something has just scribbled on this thread's stack. It doesn't look like there are that many instructions between where we initialize the timespec and where we call into port_getn, and they don't look particularly suspicious. Could it be another thread?

Well, what are those values (0xc001c97500 and 0xc000240000)? They look like addresses, and they are mapped:

> 0xc001c97500$m
            BASE            LIMIT             SIZE NAME
      c000000000       c004000000          4000000 [ unknown ]
> 0xc000240000$m
            BASE            LIMIT             SIZE NAME
      c000000000       c004000000          4000000 [ unknown ]

pmap says it's an anonymous region:

dap@ivanova port-getn-einval $ grep -i c000000000 1347.pmap
000000C000000000      65536K rw---    [ anon ]
dap@ivanova port-getn-einval $ grep -i c000000000 1347.pmap-x 
000000C000000000      65536      65536      65536          - rw---    [ anon ]

That region is in a weirdly different spot than the rest:

dap@ivanova port-getn-einval $ cat 1347.pmap
1347:	cockroach start-single-node --insecure --http-addr=:0 --store=path=/da
0000000000400000     144980K r-x--  /home/dap/tools/cockroachdb-v21.2.9/bin/cockroach
00000000091A4000       3740K rw---  /home/dap/tools/cockroachdb-v21.2.9/bin/cockroach
000000000954B000        488K rw---  /home/dap/tools/cockroachdb-v21.2.9/bin/cockroach
000000C000000000      65536K rw---    [ anon ]
FFFFFC7FB6600000       1596K rw---    [ anon ]
FFFFFC7FB678F000        452K -----    [ anon ]
FFFFFC7FB6DFF000          4K rw--R    [ stack tid=23 ]
FFFFFC7FB7800000        480K rw---    [ anon ]
FFFFFC7FB7878000       1568K -----    [ anon ]
FFFFFC7FB7C00000      16384K rw---    [ anon ]
FFFFFC7FB8FFE000          8K rw--R    [ stack tid=22 ]
FFFFFC7FB9200000         64K rw---    [ anon ]
FFFFFC7FB9210000       1984K -----    [ anon ]
FFFFFC7FB97FE000          8K rw--R    [ stack tid=14 ]
FFFFFC7FB9DFE000          8K rw--R    [ stack tid=20 ]
FFFFFC7FBA200000        716K rw---    [ anon ]
FFFFFC7FBA2B3000       1332K -----    [ anon ]
FFFFFC7FBA42B000        256K rw---    [ anon ]
FFFFFC7FBA4A4000        256K rw---    [ anon ]
FFFFFC7FBA4E9000        256K rw---    [ anon ]
FFFFFC7FBA535000        256K rw---    [ anon ]
FFFFFC7FBA584000        256K rw---    [ anon ]
FFFFFC7FBA600000         64K rw---    [ anon ]
FFFFFC7FBA610000       1984K -----    [ anon ]
FFFFFC7FBABFE000          8K rw--R    [ stack tid=21 ]
FFFFFC7FBADF1000        256K rw---    [ anon ]
FFFFFC7FBB000000        220K rw---    [ anon ]
FFFFFC7FBB037000       1828K -----    [ anon ]
FFFFFC7FBB600000         68K rw---    [ anon ]
FFFFFC7FBB611000       1980K -----    [ anon ]
FFFFFC7FBBDFE000          8K rw--R    [ stack tid=19 ]
FFFFFC7FBBEC1000        256K rw---    [ anon ]
FFFFFC7FBC200000        248K rw---    [ anon ]
FFFFFC7FBC23E000       1800K -----    [ anon ]
FFFFFC7FBC600000        176K rw---    [ anon ]
FFFFFC7FBC62C000       1872K -----    [ anon ]
FFFFFC7FBCACD000        256K rw---    [ anon ]
FFFFFC7FBCE00000        232K rw---    [ anon ]
FFFFFC7FBCE3A000       1816K -----    [ anon ]
FFFFFC7FBD3FE000          8K rw--R    [ stack tid=18 ]
FFFFFC7FBD7FE000          8K rw--R    [ stack tid=16 ]
FFFFFC7FBDBFE000          8K rw--R    [ stack tid=17 ]
FFFFFC7FBDFFE000          8K rw--R    [ stack tid=15 ]
FFFFFC7FBE400000        112K rw---    [ anon ]
FFFFFC7FBE41C000       1936K -----    [ anon ]
FFFFFC7FCE980000          4K rw---    [ anon ]
FFFFFC7FDEA00000         68K rw---    [ anon ]
FFFFFC7FDEA11000       1980K -----    [ anon ]
FFFFFC7FDEC8E000        256K rw---    [ anon ]
FFFFFC7FDEFB4000        256K rw---    [ anon ]
FFFFFC7FDF15D000        256K rw---    [ anon ]
FFFFFC7FDF400000       1260K rw---    [ anon ]
FFFFFC7FDF53B000        788K -----    [ anon ]
FFFFFC7FDF9FE000          8K rw--R    [ stack tid=11 ]
FFFFFC7FDFE00000         68K rw---    [ anon ]
FFFFFC7FDFE11000       1980K -----    [ anon ]
FFFFFC7FE03FE000          8K rw--R    [ stack tid=13 ]
FFFFFC7FE0800000        788K rw---    [ anon ]
FFFFFC7FE08C5000       1260K -----    [ anon ]
FFFFFC7FE0FFE000          8K rw--R    [ stack tid=12 ]
FFFFFC7FE15FF000          4K rw--R    [ stack tid=6 ]
FFFFFC7FE1A00000         68K rw---    [ anon ]
FFFFFC7FE1A11000       1980K -----    [ anon ]
FFFFFC7FE1FFE000          8K rw--R    [ stack tid=10 ]
FFFFFC7FE2200000         64K rw---    [ anon ]
FFFFFC7FE2210000       1984K -----    [ anon ]
FFFFFC7FE2600000         64K rw---    [ anon ]
FFFFFC7FE2610000       1984K -----    [ anon ]
FFFFFC7FE2BFE000          8K rw--R    [ stack tid=7 ]
FFFFFC7FE3000000         64K rw---    [ anon ]
FFFFFC7FE3010000       1984K -----    [ anon ]
FFFFFC7FE35FE000          8K rw--R    [ stack tid=8 ]
FFFFFC7FE39FF000          4K rw--R    [ stack tid=9 ]
FFFFFC7FE3C00000         68K rw---    [ anon ]
FFFFFC7FE3C11000       1980K -----    [ anon ]
FFFFFC7FE4000000         64K rw---    [ anon ]
FFFFFC7FE4010000       1984K -----    [ anon ]
FFFFFC7FE4600000      32768K rw---    [ anon ]
FFFFFC7FE6BFE000          8K rw--R    [ stack tid=5 ]
FFFFFC7FE8E30000          4K rw---    [ anon ]
FFFFFC7FEB000000         68K rw---    [ anon ]
FFFFFC7FEB011000       1980K -----    [ anon ]
FFFFFC7FEB7FE000          8K rw--R    [ stack tid=3 ]
FFFFFC7FEBBFE000          8K rw--R    [ stack tid=4 ]
FFFFFC7FEBE00000       2040K rw---    [ anon ]
FFFFFC7FEBFFE000          8K -----    [ anon ]
FFFFFC7FEC24C000        336K r-x--  /lib/amd64/ld.so.1
FFFFFC7FEC2B0000         12K rwx--  /lib/amd64/ld.so.1
FFFFFC7FEC2B3000          8K rwx--  /lib/amd64/ld.so.1
FFFFFC7FEC7FF000          4K rw--R    [ stack tid=2 ]
FFFFFC7FECE06000          4K rw---    [ anon ]
FFFFFC7FED400000       2116K rw---    [ anon ]
FFFFFC7FEDC00000       2048K rw---    [ anon ]
FFFFFC7FEDEB7000        256K rw---    [ anon ]
FFFFFC7FEDFA9000        256K rw---    [ anon ]
FFFFFC7FEDFF0000        256K rw---    [ anon ]
FFFFFC7FEE031000        256K rw---    [ anon ]
FFFFFC7FEE07E000        256K rw---    [ anon ]
FFFFFC7FEE0F2000        256K rw---    [ anon ]
FFFFFC7FEE140000        256K rw---    [ anon ]
FFFFFC7FEE1B7000        256K rw---    [ anon ]
FFFFFC7FEE200000       1740K rw---    [ anon ]
FFFFFC7FEE3B3000        308K -----    [ anon ]
FFFFFC7FEE425000        256K rw---    [ anon ]
FFFFFC7FEE46A000        256K rw---    [ anon ]
FFFFFC7FEE4E0000          4K rwx--    [ anon ]
FFFFFC7FEE598000          4K rw---    [ anon ]
FFFFFC7FEE64A000       1408K rw---    [ anon ]
FFFFFC7FEE7B1000        256K rw---    [ anon ]
FFFFFC7FEE7F3000       1024K rw---    [ anon ]
FFFFFC7FEE910000         64K rw---    [ anon ]
FFFFFC7FEE940000         64K rwx--    [ anon ]
FFFFFC7FEE95C000        128K rw---    [ anon ]
FFFFFC7FEE990000         64K rw---    [ anon ]
FFFFFC7FEE9B0000         64K rw---    [ anon ]
FFFFFC7FEE9C6000          4K rwxs-    [ anon ]
FFFFFC7FEE9D0000         64K rw---    [ anon ]
FFFFFC7FEE9F0000          4K rw---    [ anon ]
FFFFFC7FEEA10000         64K rw---    [ anon ]
FFFFFC7FEEA30000       1556K r-x--  /lib/amd64/libc.so.1
FFFFFC7FEEBC5000         48K rw---  /lib/amd64/libc.so.1
FFFFFC7FEEBD1000         16K rw---  /lib/amd64/libc.so.1
FFFFFC7FEEC00000       2152K r-x--  /home/dap/tools/cockroachdb-v21.2.9/lib/libstdc++.so.6
FFFFFC7FEEE29000         56K rw---  /home/dap/tools/cockroachdb-v21.2.9/lib/libstdc++.so.6
FFFFFC7FEEE37000         12K rw---  /home/dap/tools/cockroachdb-v21.2.9/lib/libstdc++.so.6
FFFFFC7FEEE3C000        256K rw---    [ anon ]
FFFFFC7FEEE80000        108K r-x--  /home/dap/tools/cockroachdb-v21.2.9/lib/libgcc_s.so.1
FFFFFC7FEEEAA000          4K rw---  /home/dap/tools/cockroachdb-v21.2.9/lib/libgcc_s.so.1
FFFFFC7FEEEC0000          4K rwx--    [ anon ]
FFFFFC7FEEEE0000          8K r-x--  /lib/amd64/librt.so.1
FFFFFC7FEEF10000          4K rwx--    [ anon ]
FFFFFC7FEEF20000          8K r-x--  /lib/amd64/libsendfile.so.1
FFFFFC7FEEF32000          4K rw---  /lib/amd64/libsendfile.so.1
FFFFFC7FEEF36000        256K rw---    [ anon ]
FFFFFC7FEEF80000        444K r-x--  /home/dap/tools/cockroachdb-v21.2.9/lib/libncurses.so.6
FFFFFC7FEEFFE000         20K rw---  /home/dap/tools/cockroachdb-v21.2.9/lib/libncurses.so.6
FFFFFC7FEF020000         64K rwx--    [ anon ]
FFFFFC7FEF040000         64K rwx--    [ anon ]
FFFFFC7FEF060000         64K rwx--    [ anon ]
FFFFFC7FEF080000        376K r-x--  /lib/amd64/libm.so.2
FFFFFC7FEF0EE000         20K rw---  /lib/amd64/libm.so.2
FFFFFC7FEF119000        128K rw---    [ anon ]
FFFFFC7FEF160000          4K rwx--    [ anon ]
FFFFFC7FEF180000          4K r-x--  /lib/amd64/libdl.so.1
FFFFFC7FEF190000          8K r-x--  /lib/amd64/libxnet.so.1
FFFFFC7FEF1B0000          4K rwx--    [ anon ]
FFFFFC7FEF1C0000          4K rwx--    [ anon ]
FFFFFC7FEF1C7000        256K rw---    [ anon ]
FFFFFC7FEF210000          4K rw---    [ anon ]
FFFFFC7FEF230000        560K r-x--  /lib/amd64/libnsl.so.1
FFFFFC7FEF2CC000         12K rw---  /lib/amd64/libnsl.so.1
FFFFFC7FEF2CF000         32K rw---  /lib/amd64/libnsl.so.1
FFFFFC7FEF2F0000          4K rwx--    [ anon ]
FFFFFC7FEF310000         24K rwx--    [ anon ]
FFFFFC7FEF320000          4K rwx--    [ anon ]
FFFFFC7FEF330000          4K rwx--    [ anon ]
FFFFFC7FEF340000          4K rwx--    [ anon ]
FFFFFC7FEF360000          4K rwx--    [ anon ]
FFFFFC7FEF370000          4K rwx--    [ anon ]
FFFFFC7FEF390000          4K rwx--    [ anon ]
FFFFFC7FEF3A0000         72K r-x--  /lib/amd64/libsocket.so.1
FFFFFC7FEF3C2000          4K rw---  /lib/amd64/libsocket.so.1
FFFFFC7FEF3D0000          4K rwx--    [ anon ]
FFFFFC7FEF3F0000          4K r--s- 
FFFFFC7FFFDF9000         28K rw---    [ stack ]
         total       333056K

which makes me wonder if it was mapped with MAP_FIXED. A quick search in the Golang source suggests that this region is used by the Go memory allocator.

So what's at those addresses? That might give a clue about who clobbered the stack.

> 0xc001c97500,0t20/np
0xc001c97500:   
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
                0                  
> 0xc000240000,0t20/np
0xc000240000:   
                0xfffffc7fba539b90 
                0xc001224040       
                go.itab.*github.com/cockroachdb/pebble/vfs.enospcFS,github.com/cockroachdb/pebble/vfs.FS
                0xc0010f6300       
                0xc0010e8000       
                0x41               
                2                  
                0x10               
                0                  
                0                  
                0                  
                0x4fdf7            
                0x30b              
                0                  
                0                  
                0                  
                0                  
                0                  
                0x50107            
                0x2de              

That's not super conclusive but it looks worryingly like CockroachDB itself or Golang did that, maybe part of Pebble (part of CockroachDB)?

One idea we had is to trace fbt::port_getn:entry for CockroachDB, record the current timespec contents, use the chill action for as long as we can, then record the contents again. If there's an unconstrained window here where some other thread is clobbering our stack, the chill should make it much more likely that we hit this problem, and if we detect a difference in the timespec before and after the chill, we can be pretty confident it wasn't the current thread that did it. Then maybe we can modify this to, say, stop the process on entry to port_getn, fork off mdb to create a watchpoint, then run all other threads and wait for the watchpoint to fire. On the other hand, if this doesn't work (e.g., if the chill doesn't make this more likely or we never see the timespec change), that doesn't tell us much. It's possible the window is smaller than we realize and introduced by something the current thread is doing, and that by the time we enter port_getn, it's too late. Anyway, that seems relatively easy to try.

@davepacheco
Copy link
Collaborator Author

I filed cockroachdb/cockroach#82958 for this issue to get input from the CockroachDB team.

@davepacheco
Copy link
Collaborator Author

Following up on my previous comment, I wrote this D script to try to blow open the window for possible corruption:

#!/usr/sbin/dtrace -Cs

/*
 * D script that attempts to blow open a window where something is corrupting
 * arguments to `port_getn`.
 * See https://github.com/oxidecomputer/omicron/issues/1130.
 */

#pragma D option quiet
#pragma D option bufpolicy=ring
#pragma D option destructive

/* from sys/port_impl.h */
#define PORT_GETN 6
#define PORT_CODE_MASK 0xff

#define CHILL_TIME_NS 10000000

fbt::portfs:entry
/execname == "cockroach" && (arg0 & PORT_CODE_MASK) == PORT_GETN && arg5 != 0/
{
        self->initial_timespec = (struct timespec *)
            copyin(arg5, sizeof (struct timespec));
        printf(
            "%Y.%09d %d portfs entry (PORT_GETN): pid %6d tid %3d: initial: ",
            walltimestamp,
            walltimestamp % 1000000000,
            timestamp,
            pid,
            tid
        );
        print(*self->initial_timespec);
        printf("\n");

        chill(CHILL_TIME_NS);

        self->later_timespec = (struct timespec *)
            copyin(arg5, sizeof (struct timespec));
        printf(
            "%Y.%09d %d portfs entry (PORT_GETN): pid %6d tid %3d: later: ",
            walltimestamp,
            walltimestamp % 1000000000,
            timestamp,
            pid,
            tid
        );
        print(*self->later_timespec);
        printf("\n");
}

fbt::portfs:entry
/self->later_timespec &&
 self->initial_timespec &&
 /* XXX-dap note: comparing the structs seems to produce false positives */
 (self->later_timespec->tv_sec != self->initial_timespec->tv_sec ||
  self->later_timespec->tv_nsec != self->initial_timespec->tv_nsec)/
{
        printf(
            "%Y.%09d %d portfs entry (PORT_GETN): pid %6d tid %3d: boom!",
            walltimestamp,
            walltimestamp % 1000000000,
            timestamp,
            pid,
            tid
        );
        printf("\ninitial: ");
        print(*self->initial_timespec);
        printf("\nlater: ");
        print(*self->later_timespec);
        printf("\n");
        exit(0);
}


fbt::portfs:entry
{
        self->initial_timespec = 0;
        self->later_timespec = 0;
}

This uses the chill() action to spin for 10ms on entry to port_getn, and it compares the contents of the timespec before and after the chill. On the one hand, this feels like a long shot? On the other hand, if something's occasionally corrupting the stack some time before we enter port_getn, and if it could also happen after entering port_getn, then it seems like this should catch it. Of course, cockroach runs quite a lot slower with this running.

@davepacheco
Copy link
Collaborator Author

The discussion on the cockroach issue got me thinking to try LD_PRELOADing libumem, which can identify various kinds of corruption when it's happening and has other facilities for debugging corruption. It didn't work to just LD_PRELOAD it with the cockroach binary we're using -- it appears to statically link to jemalloc instead of using malloc(3c). With this patch to garbage-compactor, I was able to build a version that uses malloc(3c):

dap@ivanova garbage-compactor $ git log -1
commit 7a9c62363b7bb4d5d928a4f9e717a06fe3e8bc5c (HEAD -> master, origin/master, origin/HEAD)
Author: Joshua M. Clulow <[email protected]>
Date:   Fri Jun 17 15:22:44 2022 -0700

    cockroach 21.2.9
dap@ivanova garbage-compactor $ git diff --cached
diff --git a/cockroach/patches/11-makefile-tags.patch b/cockroach/patches/11-makefile-tags.patch
new file mode 100644
index 0000000..f738a2b
--- /dev/null
+++ b/cockroach/patches/11-makefile-tags.patch
@@ -0,0 +1,11 @@
+--- a/src/github.com/cockroachdb/cockroach/Makefile
++++ b/src/github.com/cockroachdb/cockroach/Makefile
+@@ -163,7 +163,7 @@ DESTDIR :=
+ 
+ DUPLFLAGS    := -t 100
+ GOFLAGS      :=
+-TAGS         :=
++TAGS         := stdmalloc
+ ARCHIVE      := cockroach.src.tgz
+ STARTFLAGS   := -s type=mem,size=1GiB --logtostderr
+ BUILDTARGET  := ./pkg/cmd/cockroach

and that appears to work.

I tested this by using this DTrace invocation to stop a cockroach process run from my shell (pid 22949) after it'd been running a bit:

pfexec dtrace -q -w -n 'fbt::port_getn:return/progenyof(22949) && execname == "cockroach"/{  stop(); printf("%d\n", pid); exit(0); }'

Then I ran LD_PRELOAD=libumem.so UMEM_DEBUG=default cockroach version. The DTrace script stopped it, printed the pid of the cockroach process, then exited. Then I attached with mdb and ran ::findleaks.

For the original binary, I got:

mdb: findleaks: No allocations have occured -- no possible leaks.

For my new binary, I got:

BYTES             LEAKED         VMEM_SEG CALLER
4096                   4 fffffc7fbb580000 MMAP
4096                   1 fffffc7feef8f000 MMAP
4096                   1 fffffc7feae06000 MMAP
4096                   1 fffffc7fd7430000 MMAP
------------------------------------------------------------------------
           Total       4 oversized leaks, 16384 bytes

which at least means something was allocated with libumem!

@davepacheco
Copy link
Collaborator Author

I put a bunch of the relevant files on catacomb:

/data/staff/core/issues/omicron-1130 $ ls -l
total 249892
-rw-rw-r--+  1 dap      staff      21197 Nov  8 21:05 cockroach-stderr-log.gz
-rw-rw-r--+  1 dap      staff     503297 Nov  8 21:05 cockroachdb-directory.tgz
-rw-rw-r--+  1 dap      staff     743349 Nov  8 21:06 cockroachdb-zip.zip
-rw-rw-r--+  1 dap      staff    126624757 Nov  8 21:06 core.10286.gz

@davepacheco
Copy link
Collaborator Author

I also put some D scripts and other data from the June 1-2 investigation there:

dap@catacomb /data/staff/core/issues/omicron-1130 $ ls -l port-getn-einval/
total 176872
-rw-rw-r--+  1 dap      staff        412 Jun  2 21:19 1347.args
-rw-rw-r--+  1 dap      staff       1244 Jun  2 21:19 1347.env
-rw-rw-r--+  1 dap      staff       7017 Jun  2 21:19 1347.files
-rw-rw-r--+  1 dap      staff       8653 Jun  2 23:04 1347.pmap
-rw-rw-r--+  1 dap      staff      13507 Jun  2 23:04 1347.pmap-x
-rw-rw-r--+  1 dap      staff    360417613 Jun  2 21:19 core.1347
-rw-rw-r--+  1 dap      staff    106609443 Jun  2 21:18 dtrace-einval.out
-rwxrwxr-x+  1 dap      staff       1426 Jun  1 21:29 trace-cockroachdb-einval.d
dap@catacomb /data/staff/core/issues/omicron-1130 $ 

@davepacheco
Copy link
Collaborator Author

Found a few more. I regret that this isn't better organized, but I preserved the mtimes here for correlation with the comments here:

dap@catacomb /data/staff/core/issues/omicron-1130 $ ls -l more
total 166441
-rw-rw-r--+  1 dap      staff       2686 Jun 24 23:27 cockroach-issue-notes.txt
-rw-rw-r--+  1 dap      staff    365604725 Jun 16 16:53 core.4029
drwxrwxr-x+  2 dap      staff          3 May 28 22:44 new_bugs
-rwxrwxr-x+  1 dap      staff       1809 Jun 15 21:36 port-getn-chill.d
-rw-rw-r--+  1 dap      staff     948081 May 29 05:31 try_repro.out
-rwxrwxr-x+  1 dap      staff        794 Jun 23 15:46 try_repro.sh
-rw-rw-r--+  1 dap      staff     221301 Jun 22 00:39 try_repro10.out
-rw-rw-r--+  1 dap      staff    3369018 Jun 23 15:47 try_repro11.out
-rw-rw-r--+  1 dap      staff    2585000 Jun 24 10:29 try_repro12.out
-rw-rw-r--+  1 dap      staff    1591360 Jun  1 06:08 try_repro2.out
-rw-rw-r--+  1 dap      staff     280571 Jun  1 21:00 try_repro3.out
-rw-rw-r--+  1 dap      staff     957517 Jun  2 04:28 try_repro4.out
-rw-rw-r--+  1 dap      staff    2177399 Jun  2 20:23 try_repro5.out
-rw-rw-r--+  1 dap      staff      43538 Jun 15 22:22 try_repro6.out
-rw-rw-r--+  1 dap      staff      52197 Jun 17 17:21 try_repro7.out
-rw-rw-r--+  1 dap      staff    1609301 Jun 19 09:26 try_repro8.out
-rw-rw-r--+  1 dap      staff     148212 Jun 21 18:48 try_repro9.out

@davepacheco
Copy link
Collaborator Author

See #1223 (comment). With a workaround for https://www.illumos.org/issues/15254 in place, the test suite has been running for almost three days without issue. Also, this was seen on Go 1.16; we're now on Go 1.19. I can't quite see how that illumos bug could cause this issue but I'm not sure it's worth spending more active time on this?

Okay, I did take a swing at writing a Rust program to try to identify this problem from a dump. On a dump from this core file, it produces:

    $ cargo run --release < ../port-getn-einval/ahlgrep.out 2>/dev/null
    found run of 12 at addr c001172f70:
    found run of 12 at addr c001173110:
    found run of 14 at addr c001ef5c30:
    found run of 52 at addr c00200a4b0:
    found run of 28 at addr c00200ba50:

and I've confirmed that these do indeed look like runs consistent with that problem. They're all in the Go heap. I'm still not sure how we get to this root cause. The closest thing of interest is that a bunch of the stack frames are around c002702598, but that's a few megabytes away.

@davepacheco
Copy link
Collaborator Author

Since the memory that's supposed to be zero'd in this case was written using %xmm0 after that register was zero'd, it is conceivable that this problem was caused by illumos#15367. However, we'd have to get quite unlucky: it'd have to be the case that the rest of the xmm/ymm/zmm registers were all zeroes so that clearing xmm0 put the FPU into the default state; then we'd have to take a signal on that very next instruction.

@davepacheco
Copy link
Collaborator Author

Closing this for the reasons mentioned above. We can reopen if we start seeing it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

1 participant