Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not ok 6 - no jobs received alloc-check exception #1084

Closed
garlick opened this issue Sep 28, 2023 · 4 comments
Closed

not ok 6 - no jobs received alloc-check exception #1084

garlick opened this issue Sep 28, 2023 · 4 comments

Comments

@garlick
Copy link
Member

garlick commented Sep 28, 2023

It's been a while since I built flux on this ubuntu 22.04 LTS desktop but I am reliably hitting this test failure (on @grondos make-deb-fix branch for #1082):

30/85 Test #30: t1024-alloc-check.t ...................***Failed   22.05 sec
expecting success: 
    load_test_resources ${excl_1N1B}

{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-15"}}],"starttime":0.0,"expiration":0.0,"nodelist":["cab1234"]}}
ok 1 - load test resources

expecting success: 
	load_resource &&
	load_qmanager_sync

{
 "tx": {
  "request": 14,
  "response": 0,
  "event": 0,
  "control": 0
 },
 "rx": {
  "request": 1,
  "response": 5,
  "event": 0,
  "control": 0
 }
}
ok 2 - load fluxion modules

expecting success: 
	flux config load <<-EOT &&
	[job-manager]
	epilog.command = [ "flux", "perilog-run", "epilog", "-e", "sleep,2" ]
	EOT
	flux jobtap load perilog.so

ok 3 - configure epilog with delay

expecting success: 
	(for i in $(seq 5); do \
	    flux run -N1 -x -t1s sleep 30 || true; \
	done) 2>joberr

ok 4 - submit node-exclusive jobs that exceed their time limit

expecting success: 
	grep "job.exception type=timeout" joberr

1.029s: job.exception type=timeout severity=0 resource allocation expired
1.027s: job.exception type=timeout severity=0 resource allocation expired
1.061s: job.exception type=timeout severity=0 resource allocation expired
1.030s: job.exception type=timeout severity=0 resource allocation expired
1.028s: job.exception type=timeout severity=0 resource allocation expired
ok 5 - some jobs received timeout exception

expecting success: 
	test_must_fail grep "job.exception type=alloc-check" joberr

0.023s: job.exception type=alloc-check severity=0 resources already allocated
0.057s: job.exception type=alloc-check severity=0 resources already allocated
test_must_fail: command succeeded: grep job.exception type=alloc-check joberr
not ok 6 - no jobs received alloc-check exception
#	
#		test_must_fail grep "job.exception type=alloc-check" joberr
#	

expecting success: 
	flux job cancelall -f &&
	flux queue idle &&
	(flux resource undrain 0 || true)

flux-job: Canceled 2 jobs (0 errors)
0 jobs
flux-resource: ERROR: rank 0 not drained
ok 7 - clean up

expecting success: 
	(for i in $(seq 10); do \
	    flux run --ntasks=1 --cores-per-task=8 -t1s sleep 30 || true; \
	done) 2>joberr2

ok 8 - submit non-exclusive jobs that exceed their time limit

expecting success: 
	grep "job.exception type=timeout" joberr2

1.029s: job.exception type=timeout severity=0 resource allocation expired
1.028s: job.exception type=timeout severity=0 resource allocation expired
1.026s: job.exception type=timeout severity=0 resource allocation expired
1.027s: job.exception type=timeout severity=0 resource allocation expired
1.027s: job.exception type=timeout severity=0 resource allocation expired
1.027s: job.exception type=timeout severity=0 resource allocation expired
1.029s: job.exception type=timeout severity=0 resource allocation expired
1.027s: job.exception type=timeout severity=0 resource allocation expired
1.028s: job.exception type=timeout severity=0 resource allocation expired
1.028s: job.exception type=timeout severity=0 resource allocation expired
ok 9 - some jobs received timeout exception

expecting success: 
	test_must_fail grep "job.exception type=alloc-check" joberr2

0.024s: job.exception type=alloc-check severity=0 resources already allocated
0.023s: job.exception type=alloc-check severity=0 resources already allocated
test_must_fail: command succeeded: grep job.exception type=alloc-check joberr2
not ok 10 - no jobs received alloc-check exception
#	
#		test_must_fail grep "job.exception type=alloc-check" joberr2
#	

expecting success: 
	cleanup_active_jobs

Scheduling is stopped
flux-job: Canceled 2 jobs (0 errors)
0 jobs
ok 11 - clean up

expecting success: 
	remove_qmanager &&
	remove_resource

ok 12 - remove fluxion modules

# failed 2 among 12 test(s)
1..12
Sep 28 19:19:10.826176 broker.err[0]: rc2.0: sh /home/garlick/proj/flux-sched/t/t1024-alloc-check.t  --verbose Exited (rc=1) 21.4s
flux-start: 0 (pid 3410903) exited with rc=1
@trws
Copy link
Member

trws commented Oct 19, 2023

Is this one still happening? I thought this got fixed as part of one of the PRs during that flurry a few weeks ago.

@garlick
Copy link
Member Author

garlick commented Oct 19, 2023

Just ran that test 10x in a row on the same system where it was consistently failing before and no failures this time.

If it was intentionally fixed, it was not mentioned in any commit message (that I can find anyway).

@trws
Copy link
Member

trws commented Oct 19, 2023

Looks like you're right, I was thinking of something else. It makes me a bit wary about this. I haven't been able to reproduce the failure though so I'm not sure where to go with it. Maybe close pending a reappearance?

@garlick
Copy link
Member Author

garlick commented Oct 19, 2023

Yeah sounds good to me.

@garlick garlick closed this as completed Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants