Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make check hangs on Jetson Nano #3810

Closed
javawolfpack opened this issue Aug 2, 2021 · 7 comments
Closed

make check hangs on Jetson Nano #3810

javawolfpack opened this issue Aug 2, 2021 · 7 comments

Comments

@javawolfpack
Copy link

Mentioned in #3808 but make check hangs (gave it over 12 hours to complete) on a Jetson Nano 4G board.

...
PASS: python/t0006-request.py 1 __main__.TestRequestMethods.test_no_topic_invalid
PASS: python/t0006-request.py 2 __main__.TestRequestMethods.test_null_payload

^Cmake[3]: *** [Makefile:3985: python/t0007-watchers.log] Interrupt
make[2]: *** [Makefile:3953: check-TESTS] Interrupt
make[1]: *** [Makefile:4034: check-am] Interrupt
make: *** [Makefile:586: check-recursive] Interrupt

Thought it was python/t0007-watchers as that's where it stopped but can run that manually:

$ ../src/cmd/flux python python/t0007-watchers.py
TAP version 13
ok 1 __main__.TestFdWatcher.test_fd_watcher
ok 2 __main__.TestFdWatcher.test_fd_watcher_exception
ok 3 __main__.TestSignal.test_s0_signal_watcher_add
ok 4 __main__.TestSignal.test_s1_signal_watcher_remove
ok 5 __main__.TestSignal.test_signal_watcher
ok 6 __main__.TestSignal.test_signal_watcher_exception
ok 7 __main__.TestSignal.test_signal_watcher_invalid
ok 8 __main__.TestTimer.test_msg_watcher_bytes
ok 9 __main__.TestTimer.test_msg_watcher_unicode
ok 10 __main__.TestTimer.test_s1_0_timer_add
ok 11 __main__.TestTimer.test_s1_1_timer_remove
ok 12 __main__.TestTimer.test_timer_add_negative
ok 13 __main__.TestTimer.test_timer_callback_exception
ok 14 __main__.TestTimer.test_timer_with_reactor
1..14

I've tried a fresh copy of Ubuntu 20.04.2 image, and install of flux-security/flux-core and their dependencies as I thought it completed the first time I tried but still is hanging.

@grondo
Copy link
Contributor

grondo commented Aug 2, 2021

Were you running make check or make -j N check? Either way I guess there could be some race condition and thus the hang only occurs sometimes or rarely. Even in the extremely resource constrained CI environment, make check completes in <30 minutes, so if the testsuite isn't making progress for an hour you can assume it is hung.

Sometimes running a test under taskset -c 0 can reproduce racy test hangs more readily, perhaps you can try:

taskset -c 0 ../src/cmd/flux python python/t0007-watchers.py

If that doesn't work, then run make check until you notice the testsuite no longer makes progress and run pstree or similar to see which test is hung. We can maybe think of some further debugging steps after that.

Thanks for your help running this down!

@javawolfpack
Copy link
Author

Just make check... have the check running currently so will try the pstree when it hangs again assuming it does this time. Dumping the output to a file as there was 1 other error I think I've resolved but can't see in the terminal history at the point it hangs.

If pstree shows the python test, will try that taskset approach and let you know what happens.

@grondo
Copy link
Contributor

grondo commented Aug 2, 2021

Thanks @javawolfpack!

@javawolfpack
Copy link
Author

Hung so installed pstree and this is the output from the deepest make check in my ps list... seems like make check ran make check, which ran this make check-TESTS.

$ pstree 1992215
make───bash───make───bash───tap-driver.sh─┬─awk
                                          └─tap-driver.sh───python3───flux-start─┬─flux-broker-0─┬─python3
                                                                                 │               └─16*[{flux-broker-0}]
                                                                                 └─flux-broker-1───9*[{flux-broker-1}]

I then tried running the python/t0007-watchers.py numerous times w/ taskset and they all run to completion.

In positive news aside from it hanging all the other tests seem to be passing now. And as this python one and the one after are the only remaining and they pass fine manually for the moment going to build & test flux-sched so I can hopefully figure out how to configure it so our students can use it on our new cluster this Fall. But if you have any ideas on other things you'd like me to try to troubleshoot this let me know.

@grondo
Copy link
Contributor

grondo commented Aug 2, 2021

Thanks, it definitely seems like one of the python tests is the cause of this hang.
We may need to run pstree -a just to be sure it is the test we're assuming.

Once we have that information, I'll try to create a custom version of the test that will bail out on any hang to get us a better idea of where the hang is occurring. That may be easier than trying to diagnose by inspection.

Since the rest of the tests pass, I have a feeling this is a test bug not a bug in the rest of the system, so you can probably set up the rest of Flux and install with confidence (I hope).

@javawolfpack
Copy link
Author

javawolfpack commented Aug 2, 2021

I'll run the make check again in a moment and do a pstree -a once it hangs again. Currently running the make check on flux-sched

@garlick
Copy link
Member

garlick commented Jul 25, 2022

Old bug, reopen if this is still a problem.

@garlick garlick closed this as completed Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants