Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t5000-valgrind test fails on Jetson Nano #855

Closed
javawolfpack opened this issue Aug 2, 2021 · 13 comments
Closed

t5000-valgrind test fails on Jetson Nano #855

javawolfpack opened this issue Aug 2, 2021 · 13 comments

Comments

@javawolfpack
Copy link

Pretty sure this is the same issue as this issue:

flux-framework/flux-core#3808

The patch for that issue I'm guessing needs to be tweaked for flux-sched let me know if you want more info from the flux-sched instance of the test.

@garlick
Copy link
Member

garlick commented Aug 2, 2021

That makes sense. Want to submit a PR against flux-sched? Eyeballing it, it seems like the same patch ought to apply (same path, and same patch context AFAICT).

@javawolfpack
Copy link
Author

Let me try again... but pretty sure it failed even after trying to patch with the same patch file as flux-core.

@dongahn
Copy link
Member

dongahn commented Aug 2, 2021

Do you have the valgrind outputs from your failure to post?

@javawolfpack
Copy link
Author

Do you have the valgrind outputs from your failure to post?

Figuring out how to run the flux-sched test on it's own with verbose output was my next step if it fails again. Is there something different from flux-core?

@garlick
Copy link
Member

garlick commented Aug 2, 2021

Should be the same as flux-core: change to the t directory and run ./t5000-valgrind.t -d -v

@garlick
Copy link
Member

garlick commented Aug 2, 2021

The patch just adds a new stanza to he valgrind.supp file so it should be easy to recreate the change by hand if needed. I guess it should reference this bug number at the top instead of the core one, so it would need a little TLC anyway.

@javawolfpack
Copy link
Author

Should be the same as flux-core: change to the t directory and run ./t5000-valgrind.t -d -v

$ ./t5000-valgrind.t -d -v
sharness: loading extensions from /home/user/flux-sched/t/sharness.d/flux-sharness.sh
sharness: loading extensions from /home/user/flux-sched/t/sharness.d/sched-sharness.sh
expecting success:
	run_timeout 400 \
	flux start -s ${VALGRIND_NBROKERS} \
		--killer-timeout=120 \
		--wrap=libtool,e,${VALGRIND} \
		--wrap=--tool=memcheck \
		--wrap=--leak-check=full \
		--wrap=--gen-suppressions=all \
		--wrap=--trace-children=no \
		--wrap=--child-silent-after-fork=yes \
		--wrap=--num-callers=30 \
		--wrap=--leak-resolution=med \
		--wrap=--error-exitcode=1 \
		--wrap=--suppressions=$VALGRIND_SUPPRESSIONS \
		 ${VALGRIND_WORKLOAD}

==2548309== Memcheck, a memory error detector
==2548309== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==2548309== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==2548309== Command: /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-RJUk2d
==2548309==
==2548308== Memcheck, a memory error detector
==2548308== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==2548308== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==2548308== Command: /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-RJUk2d /home/user/flux-sched/t/valgrind/valgrind-workload.sh
==2548308==
FLUX_URI=local:///tmp/flux-RJUk2d/local-0
not ok 1 - valgrind reports no new errors on 2 broker run
#
#		run_timeout 400 \
#		flux start -s ${VALGRIND_NBROKERS} \
#			--killer-timeout=120 \
#			--wrap=libtool,e,${VALGRIND} \
#			--wrap=--tool=memcheck \
#			--wrap=--leak-check=full \
#			--wrap=--gen-suppressions=all \
#			--wrap=--trace-children=no \
#			--wrap=--child-silent-after-fork=yes \
#			--wrap=--num-callers=30 \
#			--wrap=--leak-resolution=med \
#			--wrap=--error-exitcode=1 \
#			--wrap=--suppressions=$VALGRIND_SUPPRESSIONS \
#			 ${VALGRIND_WORKLOAD}
#

# failed 1 among 1 test(s)
1..1

Valgrind output, this is after applying the same patch from flux-core my guess is I need to manually add those lines to the valgrind.supp file.

@javawolfpack
Copy link
Author

Going to test manually adding those lines to the supp file w/ this issue number and if it works will make a PR.

@javawolfpack
Copy link
Author

javawolfpack commented Aug 2, 2021

I've added the following lines to the end of t/valgrind/valgrind.supp and it still fails in the make check:

{
   <issue_855>
   Memcheck:Param
   epoll_ctl(event)
   fun:epoll_ctl
   fun:epoll_modify
   fun:fd_reify
   fun:ev_run
   ...
}

@dongahn
Copy link
Member

dongahn commented Aug 2, 2021

It is odd that we don't see the stack traces of memory errors. One possibility is this test fails not because of an memory error but because of time out. Could you increase run_timeout in the test like 1200 and see if the test is happier?

@javawolfpack
Copy link
Author

It is odd that we don't see the stack traces of memory errors. One possibility is this test fails not because of an memory error but because of time out. Could you increase run_timeout in the test like 1200 and see if the test is happier?

So it passes when I manually set the run timeout to 1200.

$ ./t5000-valgrind.t -d -v
sharness: loading extensions from /home/user/flux-sched/t/sharness.d/flux-sharness.sh
sharness: loading extensions from /home/user/flux-sched/t/sharness.d/sched-sharness.sh
expecting success:
	run_timeout 1200 \
	flux start -s ${VALGRIND_NBROKERS} \
		--killer-timeout=120 \
		--wrap=libtool,e,${VALGRIND} \
		--wrap=--tool=memcheck \
		--wrap=--leak-check=full \
		--wrap=--gen-suppressions=all \
		--wrap=--trace-children=no \
		--wrap=--child-silent-after-fork=yes \
		--wrap=--num-callers=30 \
		--wrap=--leak-resolution=med \
		--wrap=--error-exitcode=1 \
		--wrap=--suppressions=$VALGRIND_SUPPRESSIONS \
		 ${VALGRIND_WORKLOAD}

==47386== Memcheck, a memory error detector
==47386== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==47386== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==47386== Command: /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-ZYssQb
==47386==
==47385== Memcheck, a memory error detector
==47385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==47385== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==47385== Command: /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-ZYssQb /home/user/flux-sched/t/valgrind/valgrind-workload.sh
==47385==
FLUX_URI=local:///tmp/flux-ZYssQb/local-0
Running 00-job
Submitting 10 jobs
f56dfGEfR
f57MNn2aB
f57nCL1D1
f58Rcmsmq
f58tUvrxF
f59a9qeKR
f5AZ6x4Bh
f5BU9EMwD
f5CCSqsKu
f5CsdatLX
Waiting jobs to complete
Completed
2021-08-02T22:47:25.187652Z sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Operation canceled
==47386==
==47386== HEAP SUMMARY:
==47386==     in use at exit: 268,007 bytes in 3,353 blocks
==47386==   total heap usage: 107,968 allocs, 104,615 frees, 2,907,408,583 bytes allocated
==47386==
==47386== LEAK SUMMARY:
==47386==    definitely lost: 0 bytes in 0 blocks
==47386==    indirectly lost: 0 bytes in 0 blocks
==47386==      possibly lost: 0 bytes in 0 blocks
==47386==    still reachable: 267,831 bytes in 3,351 blocks
==47386==         suppressed: 176 bytes in 2 blocks
==47386== Reachable blocks (those to which a pointer was found) are not shown.
==47386== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==47386==
==47386== For lists of detected and suppressed errors, rerun with: -s
==47386== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 524 from 2)
==47385==
==47385== HEAP SUMMARY:
==47385==     in use at exit: 275,674 bytes in 3,375 blocks
==47385==   total heap usage: 583,004 allocs, 579,629 frees, 136,495,400,551 bytes allocated
==47385==
==47385== LEAK SUMMARY:
==47385==    definitely lost: 0 bytes in 0 blocks
==47385==    indirectly lost: 0 bytes in 0 blocks
==47385==      possibly lost: 0 bytes in 0 blocks
==47385==    still reachable: 275,322 bytes in 3,371 blocks
==47385==         suppressed: 352 bytes in 4 blocks
==47385== Reachable blocks (those to which a pointer was found) are not shown.
==47385== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==47385==
==47385== For lists of detected and suppressed errors, rerun with: -s
==47385== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1646 from 2)
ok 1 - valgrind reports no new errors on 2 broker run

# passed all 1 test(s)
1..1

@dongahn
Copy link
Member

dongahn commented Aug 4, 2021

Is there anything else that needs to be done before closing it?

@javawolfpack
Copy link
Author

No since extending the time out solved it, guess that's it. I'm curious if that'd solve the flux-core issue too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants