Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sched fails to load hwloc data from KVS at scale #1361

Closed
trws opened this issue Mar 15, 2018 · 9 comments
Closed

sched fails to load hwloc data from KVS at scale #1361

trws opened this issue Mar 15, 2018 · 9 comments

Comments

@trws
Copy link
Member

trws commented Mar 15, 2018

With 500 brokers, the following error is produced:

2018-03-15T20:43:44.086519Z sched.err[0]: can't load hwloc data: resrc_generate_hwloc_resources: Failed to create resrc from hwloc depth 1: Success
2018-03-15T20:43:44.086548Z sched.err[0]: failed to load resrc using hwloc
2018-03-15T20:43:44.086566Z sched.err[0]: failed to load resources
2018-03-15T20:43:44.086586Z sched.crit[0]: fatal error: Success

I find it especially odd that the reported error is "Success." This is being reported here rather than there to tie to the tracking issue and because I'm not sure which end the problem is on.

@dongahn
Copy link
Member

dongahn commented Mar 15, 2018

The error message is coming out of here

That's caused by this function returning an error code.

But the trace isn't enough to determine which sub-functions within rsreader_hwloc_load failed. One possibility is that the hwloc xml buffer in the kvs isn't incomplete though it is possible there is an issue with rs2rank functions or resrc_generate_hwloc_resources.

@trws: can you throw some flux_log_error into rsreader_hwloc_load and see which one is actually failing?

I think "Success" error is produced, because sched doesn't set an errno before exiting? Is a module supposed to set an errno before an abnormal exit?

@trws
Copy link
Member Author

trws commented Mar 15, 2018

I found it, and fixed it, the reader is hard-coded to the resources it will take, and sierra has a new one. WIll upload patch when time.

@dongahn
Copy link
Member

dongahn commented Mar 15, 2018

Great!

@grondo
Copy link
Contributor

grondo commented Mar 15, 2018

I think "Success" error is produced, because sched doesn't set an errno before exiting? Is a module supposed to set an errno before an abnormal exit?

The sched module is logging with flux_log_error which always appends string representation of errno. In this case the underlying function doesn't seem to set an errno so it is 0 or "Success".
( Explains the error message: Failed to create resrc from hwloc depth 1: Success)

And yes, if the module's main() function returns with return code < 0 then flux-core assumes errno is set and issues the critical error message:

        flux_log (p->h, LOG_CRIT, "fatal error: %s", strerror (errno));

Probably this should eventually be rethought, there isn't going to be enough detail available in an errno to be of any use, and flux-core should just log that the module exited with nonzero status, and let the module log a more detailed reason why.

@garlick
Copy link
Member

garlick commented Apr 9, 2018

I found it, and fixed it, the reader is hard-coded to the resources it will take, and sierra has a new one. WIll upload patch when time.

Pointer (or cut & paste) of patch would be appreciated so we can fix this one on master.

@trws
Copy link
Member Author

trws commented Apr 9, 2018

It's right here over in sched. The "group" type was unhandled. If possible it would be really good to switch this over to something more generic like what resource.c uses, but this gets it done for sierra.

@dongahn
Copy link
Member

dongahn commented Apr 9, 2018

It seems this should go into resrc. As part of this, it will be good to take a representative hwloc xml file from sierra and add to our tests. I will see of I can do a quick PR before moving to another project.

@garlick
Copy link
Member

garlick commented Apr 9, 2018

Closing - reopened as flux-framework/flux-sched#308

@garlick garlick closed this as completed Apr 9, 2018
@dongahn
Copy link
Member

dongahn commented Apr 9, 2018

@trws. And yes, hwloc reader support should go into the resource layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants