Add infrastructure to parse scheduler output for `CalcJobs` #3906

sphuber · 2020-04-07T17:14:13Z

Add a new method Scheduler.parse_output that takes three arguments:
detailed_job_info, stdout and stderr, which are the dictionary
returned by Scheduler.get_detailed_job_info and the content of
scheduler stdout and stderr files from the repository, respectively.

A scheduler plugin can implement this method to parse the content of
these data sources to detect standard scheduler problems such as node
failures and out of memory errors. If such an error is detected, the
method can return an ExitCode that should be defined on the
calculation job class. The CalcJob base class already defines certain
exit codes for common errors, such as an out of memory error.

If the detailed job info, stdout and stderr from the scheduler output
are available after the job has been retrieved, and the scheduler plugin
that is used has implemented parse_output, it will be called by the
CalcJob.parse method. If an exit code is returned, it is set on the
corresponding node and a warning is logged. Subsequently, the normal
output parser is called, if any was defined in the inputs, which can
then of course check the node for the presence of an exit code. It then
has the opportunity to parse the retrieved output files, if any, to try
and determine are more specific error code, if applicable. Returning an
exit code from the output parser will override the exit code set by the
scheduler parser. This is why that exit code is also logged as a warning
so that the information is not completely lost.

sphuber · 2020-04-07T17:18:30Z

@espenfl on top of this PR, I have a working branch that implements the new Scheduler.parse_stdout for the SlurmScheduler. It currently only implements parsing for the OOM error, but if we can verify that this works, it should be relatively easy to extend to other standard errors we want to detect. If you have the time, could you try and give this a run, to see if this would work.

I think this would then also fully address the open issue and supersede PR #3261 and PR #3647 or is there still some functionality in there that would not be covered with this PR? Some of the functionality in your PRs actual is already present in aiida-core now, such as retrieving the detailed information after the job is retrieved.

codecov · 2020-04-07T17:27:29Z

Codecov Report

Merging #3906 into develop will increase coverage by 0.01%.
The diff coverage is 77.93%.

@@             Coverage Diff             @@
##           develop    #3906      +/-   ##
===========================================
+ Coverage    79.08%   79.09%   +0.01%     
===========================================
  Files          468      468              
  Lines        34622    34642      +20     
===========================================
+ Hits         27377    27395      +18     
- Misses        7245     7247       +2

Flag	Coverage Δ
#django	`72.71% <77.93%> (+0.01%)`	⬆️
#sqlalchemy	`71.90% <77.93%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/backends/testimplbase.py	`84.85% <ø> (ø)`
aiida/calculations/arithmetic/add.py	`100.00% <ø> (ø)`
aiida/calculations/templatereplacer.py	`25.93% <ø> (-0.90%)`	⬇️
aiida/engine/daemon/execmanager.py	`61.48% <ø> (-1.07%)`	⬇️
aiida/parsers/plugins/templatereplacer/doubler.py	`17.08% <0.00%> (-4.20%)`	⬇️
aiida/engine/processes/calcjobs/calcjob.py	`81.69% <78.09%> (-0.32%)`	⬇️
aiida/parsers/plugins/arithmetic/add.py	`95.46% <100.00%> (+10.84%)`	⬆️
aiida/schedulers/scheduler.py	`73.13% <100.00%> (+0.35%)`	⬆️
aiida/schedulers/plugins/pbsbaseclasses.py	`56.87% <0.00%> (-2.28%)`	⬇️
aiida/parsers/parser.py	`85.72% <0.00%> (-1.78%)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 477fe30...8562f5e. Read the comment docs.

ltalirz · 2020-04-17T16:28:36Z

Thanks a lot @sphuber , I think this is a crucial addition to the scheduler handling and will be very useful!

A scheduler plugin can implement this method to parse the content of
these data sources to detect standard scheduler problems such as node
failures and out of memory errors. If such an error is detected, the
method can return an ExitCode that should be defined on the
calculation job class. The CalcJob base class already defines certain
exit codes for common errors, such as an out of memory error.

Just to make sure I understand you here: while, technically, a scheduler plugin could return any exit code that is recognized by a custom calcjob class, I guess we should strongly advise to only return ExitCodes defined on the calcjob base class.

I think the approach chosen by you is very sensible (as opposed to, e.g., build some machinery that would allow scheduler plugins to define their own exit codes), since it gives AiiDA users the ability to deal with scheduler errors in a more abstract way.
If they want to see the specific information from the scheduler plugin, they can look for the Warning in the logs.

If an exit code is returned, it is set on the
corresponding node and a warning is logged. Subsequently, the normal
output parser is called,

While I get that this approach is the most flexible one, and should probably be supported, it does mean extra work for developers of existing plugins.

As the developer of an existing plugin, I would much prefer to see an exit code referring to an "OOM" rather than some generic "file not found / parsing failed" error.
With the approach above, I would need to modify my parse function in order to take advantage of the new functionality.

What if, instead, there was a flag EXIT_ON_SCHEDULER_ERROR (or similar) on the CalcJob class that is True by default and would skip the regular parsing, if the scheduler parser would return a non-zero exit code?
Developers who want to process calculations even in the event of a scheduler error, can than simply switch that flag.

Of course one could think of more flexible ways than a flag (a whitelist / blacklist of scheduler errors); perhaps this would be too much.

sphuber · 2020-04-17T18:32:30Z

Thanks for the comments @ltalirz

Just to make sure I understand you here: while, technically, a scheduler plugin could return any exit code that is recognized by a custom calcjob class, I guess we should strongly advise to only return ExitCodes defined on the calcjob base class.

Yes, there is no check as of yet that any exit code returned is one that we define, but even if we did, there is no way we can check that what is returned is actually correct. I think therefore checking does not matter and it is in everyone's interest to have all scheduler plugins behave as coherent as possible. We already control a big part of them through aiida-core anyway so I don't think this will pose a problem.

While I get that this approach is the most flexible one, and should probably be supported, it does mean extra work for developers of existing plugins.
As the developer of an existing plugin, I would much prefer to see an exit code referring to an "OOM" rather than some generic "file not found / parsing failed" error.

I don't think this is so much a problem for plugin developers (as long as they are aware of it) as for the users of it, but yes I see your point.

With the approach above, I would need to modify my parse function in order to take advantage of the new functionality.

This is absolutely true, albeit not too much work. Adding the following at the top would suffice:

if self.node.exit_status:
    return ExitCode(self.node.exit_status, self.node.exit_message)

I would probably provide a utility method on the CalcJobNode that will allow the last line to be replaced by return self.node.exit_code

The biggest reason why I have opted for this approach now, is that this is backwards compatible, while your proposed solution is not. Now it would be a good idea to present these two options to the mailing list to see what users and developers would prefer.

For a kick-off: @chrisjsewell @greschd @espenfl and @giovannipizzi what do you think about these two possibilities? Exit on scheduler exit code by default or not?

espenfl · 2020-04-17T19:02:40Z

@espenfl on top of this PR, I have a working branch that implements the new Scheduler.parse_stdout for the SlurmScheduler. It currently only implements parsing for the OOM error, but if we can verify that this works, it should be relatively easy to extend to other standard errors we want to detect. If you have the time, could you try and give this a run, to see if this would work.

I think this would then also fully address the open issue and supersede PR #3261 and PR #3647 or is there still some functionality in there that would not be covered with this PR? Some of the functionality in your PRs actual is already present in aiida-core now, such as retrieving the detailed information after the job is retrieved.

Yes, I will have a look at this and suggest improvements if need be. Thanks.

espenfl · 2020-04-17T19:13:54Z

For a kick-off: @chrisjsewell @greschd @espenfl and @giovannipizzi what do you think about these two possibilities? Exit on scheduler exit code by default or not?

Thanks @sphuber for collecting all the scattered work on this into one PR and also the additional improvements. I think it makes total sense to integrate it with the current framework for exit codes.

When it comes to the flow of this I would certainly advice that we at the parsing step have the posibility to know what kind of exit code was returned from the scheduler. Sometimes you might need to parse a different file, check some extra parameters before loading data into a node or similar. This also means we should not close shop once we see an exit code from the scheduler. We should continue with the parsing step as I believe there are plenty of scheduling errors that can appear, but can leave the calculation salvageable, or in a state where we need more info from the parser to act. In the end this might result in a different final exit code that is returned or no exit code at all.

espenfl · 2020-04-17T19:22:21Z

Another issue we should think about already now is; can we call this machinery from a transport monitor? At some point we need to monitor the job info, scheduler stdout/stderr and code stdout/stderr while it is running. It would be nice if we could reuse this, or rewrite it now so that it is general enough to be used in such a context.

ltalirz · 2020-04-17T19:24:39Z

This is absolutely true, albeit not too much work.

That's clear - however with 49 plugins registered, any change we require from plugin developers will take a significant amount of time to propagate (and some otherwise working plugins may never see it).

When it comes to the flow of this I would certainly advice that we at the parsing step have the posibility to know what kind of exit code was returned from the scheduler.

@espenfl I think we anyhow all agree that this must me possible. The question is: what should happen by default?

Exiting on non-zero scheduler exit codes would have the advantage that all plugins will benefit from the functionality automatically - however, as @sphuber mentions this is is a change of behavior.

Do any of you guys have an existing example in mind, where this change in behavior would be detrimental?

ltalirz · 2020-04-17T19:31:21Z

Thinking more about this, there is even an alternative route, where the switch to "exit immediately" exists but is turned off by default.
This could give some time to experiment with turning it on in a couple of plugins (I would most likely do so in mine)

Once we feel it is safe, we can then still change the default value in a later release.

espenfl · 2020-04-17T19:38:19Z

@espenfl I think we anyhow all agree that this must me possible. The question is: what should happen by default?

Exiting on non-zero scheduler exit codes would have the advantage that all plugins will benefit from the functionality automatically - however, as @sphuber mentions this is is a change of behavior.

This is also what I think. Adding a default to break on non-zero exit codes in case developers have not added other functionality is maybe formally not so easy. If the exit codes and whatever we chose to implement in the scheduler plugins is not overly aggressive this is easy (say OOM and walltime). In the end I think we will end up with a rather extensive set of exit codes, some which are not that critical, possibly even just a scheduler error, but the calculation finished just fine. The scheduler stdout/stderr will house a lot of things which might not linked to the scheduler, say MPI stuff. We might get errors here, which are not critical, but that we want to know about and possibly act on, say by adjusting the node/cpu spread or whatever. This would be a use case. Probably many. We can of course argue that these type of exit codes should maybe formally not be on the scheduler, but right now, we have no other place to put it.

But it is a while before we reach this and in the meantime I think most plugin developers would like the benefits from OOM and walltime etc. So I guess in practical terms this is more easy.

mbercx

Thanks for all the work, @sphuber! I just have one question/suggestion and spotted a typo.

aiida/engine/processes/calcjobs/calcjob.py

mbercx · 2020-08-26T12:07:07Z

Regarding the discussion on whether to exit immediately if the Scheduler.parse_output method returns an exit code, I'd also vote for not exiting by default. I like @ltalirz's suggestion of adding a switch to the CalcJob class (as a class variable, e.g. _return_scheduler_exitcode?) that is set to False by default. This way the plugin developers can simply set this to True for their calculation jobs in case they want to return the scheduler exit code.

sphuber · 2020-08-26T12:52:57Z

Thanks @mbercx , I addressed the two comments and rebased on latest develop.

Regarding the discussion on whether to exit immediately if the Scheduler.parse_output method returns an exit code, I'd also vote for not exiting by default. I like @ltalirz's suggestion of adding a switch to the CalcJob class (as a class variable, e.g. _return_scheduler_exitcode?) that is set to False by default. This way the plugin developers can simply set this to True for their calculation jobs in case they want to return the scheduler exit code.

This might potentially be a useful feature, but I would not implement it now. Once we have people asking for it we can open a feature request and implement it.

ltalirz · 2020-08-26T15:40:19Z

Good for me as well to skip adding this option for the time being.

sphuber · 2020-08-26T15:55:37Z

I just realized a subtlety that we might want to change. If the parser does not return a specific error code, but just None, which technically currently corresponds to signalling that everything went fine and so is the same as returning ExitCode(0), should this also override any potential error codes set from the scheduler or not? Currently that is what it does. Although this is technically consistent with current functionality, it may be undesirable default behavior. To keep the scheduler exit code, one would have to manually return it, i.e return self.node.exit_code. Otherwise, we would have to change the meaning of returning None and returning ExitCode(0) from a parser. The latter would override any scheduler exit codes, if any, but the former would not. What do we do?

mbercx · 2020-08-26T16:46:16Z

I just realized a subtlety that we might want to change. If the parser does not return a specific error code, but just None, which technically currently corresponds to signalling that everything went fine and so is the same as returning ExitCode(0), should this also override any potential error codes set from the scheduler or not? Currently that is what it does.

Hmm, good point. I'd say that in the (not very common?) case that the calculation parser doesn't return an exit code and the scheduler does, we should return the scheduler exit code. Else the user won't see that there was an issue unless he/she checks the logs.

Otherwise, we would have to change the meaning of returning None and returning ExitCode(0) from a parser.

I'm not sure what you mean here... 😅

espenfl · 2020-08-26T18:18:56Z

I just realized a subtlety that we might want to change. If the parser does not return a specific error code, but just None, which technically currently corresponds to signalling that everything went fine and so is the same as returning ExitCode(0), should this also override any potential error codes set from the scheduler or not? Currently that is what it does. Although this is technically consistent with current functionality, it may be undesirable default behavior. To keep the scheduler exit code, one would have to manually return it, i.e return self.node.exit_code. Otherwise, we would have to change the meaning of returning None and returning ExitCode(0) from a parser. The latter would override any scheduler exit codes, if any, but the former would not. What do we do?

Thanks for picking up on this again. What I would expect is that the exit code penetrates, regardless of source. If the scheduler sets an exit code I would expect it to be rather serious (as some do not for many errors that are relevant). If we do a parse and detect nothing I would trust the exit code more than the parsing, at least from a programmatically perspective. Also, I would expect it is more common to change the formatting of the text than the exit code. So I would make sure the exit code from the parser is what sticks, unless you manually override it during parsing.

sphuber · 2020-08-26T20:22:07Z

Yeah, I thought about it some more and the current behavior is not what we want. For example, if there is an OOM, but the parser doesn't check or doesn't notice and so returns None, which is the default behavior, the OOM exit code of the scheduler will be overwritten by ExitCode(0) because currently, returning nothing from a parser means everything is ok. With this behavior many basic parsers will therefore completely override scheduler exit codes, which is certainly not what we want. Advanced parsers that would anyway detect the problem and provide another exit code, might therefore override the scheduler one, but at least they won't be marked as 0. So I will adapt this PR first before we merge it.

Add a new method `Scheduler.parse_output` that takes three arguments: `detailed_job_info`, `stdout` and `stderr`, which are the dictionary returned by `Scheduler.get_detailed_job_info` and the content of scheduler stdout and stderr files from the repository, respectively. A scheduler plugin can implement this method to parse the content of these data sources to detect standard scheduler problems such as node failures and out of memory errors. If such an error is detected, the method can return an `ExitCode` that should be defined on the calculation job class. The `CalcJob` base class already defines certain exit codes for common errors, such as an out of memory error. If the detailed job info, stdout and stderr from the scheduler output are available after the job has been retrieved, and the scheduler plugin that is used has implemented `parse_output`, it will be called by the `CalcJob.parse` method. If an exit code is returned, it is set on the corresponding node and a warning is logged. Subsequently, the normal output parser is called, if any was defined in the inputs, which can then of course check the node for the presence of an exit code. It then has the opportunity to parse the retrieved output files, if any, to try and determine a more specific error code, if applicable. Returning an exit code from the output parser will override the exit code set by the scheduler parser. This is why that exit code is also logged as a warning so that the information is not completely lost. This choice does change the old behavior when an output parser would return `None` which would be interpreted as `ExitCode(0)`. However, now if the scheduler parser returned an exit code, it will not be overridden by the `None` of the output parser, which is then essentially ignored. This is necessary, because otherwise, basic parsers that don't return anything even if an error might have occurred will always just override the scheduler exit code, which is not desirable.

The `ERROR_NO_RETRIEVED_FOLDER` is now defined on the `CalcJob` base class and the `CalcJob.parse` method already checks for the presence of the retrieved folder and return the exit code if it is missing. This allows us to remove the similar exit codes that are currently defined on the calculation plugins shipped with `aiida-core` `ArithmeticAddCalculation` and `TemplateReplacerCalculation` as well as the check for the presence of the `retrieved` output from the corresponding parsers. The fact that is now checked in the `CalcJob` base class means that `Parser` implementations can assume safely that the retrieved output node exists.

sphuber · 2020-08-27T11:12:48Z

@mbercx and @espenfl I have updated the code with the new behavior. I have added a test for it and added some documentation to explain the rules around exit codes. Please have a look at those changes to see if they look okay.

mbercx · 2020-08-27T13:20:40Z

Great work, @sphuber! I've run a few tests for the slurm scheduler (after applying fe31e42438ba770a1c5363c2e6cc6d0ed265bc7a to this PR's branch) with aiida-quantumespresso/aiida-sirius and the current code exactly reproduces the desired behaviour.

I have noticed that the NLCGParser from aiida-sirius specifically returns ExitCode(0) if it encounters no issues. This may be the case for the parser of other plugins, so we should probably highlight this new feature for the plugin developers when we release it.

Other than that, I have no further comments, so this is ready to go for me.

greschd · 2020-08-28T09:31:24Z

aiida/engine/processes/calcjobs/calcjob.py

@@ -192,6 +193,14 @@ def define(cls, spec: CalcJobProcessSpec):
            help='Files that are retrieved by the daemon will be stored in this node. By default the stdout and stderr '
                 'of the scheduler will be added, but one can add more by specifying them in `CalcInfo.retrieve_list`.')

+        # Errors caused or returned by the scheduler


@sphuber sorry only saw this now: should some (all?) off these exit codes set invalidates_cache=True?

Good point, I think they probably should. Will make a PR to correct it, we are now testing the implementation of the scheduler output parser for the SLURM plugin

Yeah.. I'm pretty sure about the 100 error. For 110 and 120, I think we should check if the inputs related to how much memory / walltime is requested go into the caching hash.
If they do not, invalidates_cache=True seems right to me. If they do, False is probably right - because the job is unlikely to succeed with the same resources.

sphuber requested review from chrisjsewell, greschd, giovannipizzi and espenfl April 7, 2020 17:14

sphuber force-pushed the feature/2431/scheduler-output-parsing branch 5 times, most recently from 1d3d633 to b860d2a Compare April 14, 2020 13:40

This was referenced Apr 14, 2020

Implement Scheduler.parse_output for the SlurmScheduler plugin #3931

Merged

WIP: Restructured the detailed job info from SLURM #3261

Closed

Initial prototype for scheduler parsing #3647

Closed

sphuber force-pushed the feature/2431/scheduler-output-parsing branch 2 times, most recently from 27004f4 to c3a9406 Compare April 15, 2020 21:23

sphuber force-pushed the feature/2431/scheduler-output-parsing branch from c3a9406 to 47cbdbc Compare April 22, 2020 09:05

sphuber added the pr/ready-for-review PR is ready to be reviewed label Jun 4, 2020

sphuber force-pushed the feature/2431/scheduler-output-parsing branch from 47cbdbc to 98d7881 Compare June 7, 2020 17:23

sphuber force-pushed the feature/2431/scheduler-output-parsing branch from 98d7881 to 05603e0 Compare June 23, 2020 15:54

mbercx requested changes Aug 26, 2020

View reviewed changes

aiida/engine/processes/calcjobs/calcjob.py Outdated Show resolved Hide resolved

aiida/engine/processes/calcjobs/calcjob.py Outdated Show resolved Hide resolved

sphuber force-pushed the feature/2431/scheduler-output-parsing branch 2 times, most recently from 33169b0 to 3191858 Compare August 26, 2020 12:48

sphuber force-pushed the feature/2431/scheduler-output-parsing branch from 3191858 to ed98ba4 Compare August 27, 2020 08:53

sphuber added 2 commits August 27, 2020 13:11

sphuber force-pushed the feature/2431/scheduler-output-parsing branch from ed98ba4 to 8562f5e Compare August 27, 2020 11:11

mbercx self-requested a review August 27, 2020 13:20

mbercx approved these changes Aug 27, 2020

View reviewed changes

sphuber merged commit f23078e into aiidateam:develop Aug 27, 2020

sphuber deleted the feature/2431/scheduler-output-parsing branch August 27, 2020 13:30

This was referenced Aug 27, 2020

Double check calculation job really has finished from detailed scheduler info when normal scheduler update says job is terminated #2431

Open

Implement basic infrastructure to allow scheduler plugins to parse their output #4331

Closed

greschd reviewed Aug 28, 2020

View reviewed changes

sphuber mentioned this pull request Aug 28, 2020

Determine which built in exit codes for CalcJob have to have set invalidates_cache=True #4338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add infrastructure to parse scheduler output for `CalcJobs` #3906

Add infrastructure to parse scheduler output for `CalcJobs` #3906

sphuber commented Apr 7, 2020 •

edited

Loading

sphuber commented Apr 7, 2020

codecov bot commented Apr 7, 2020 •

edited

Loading

ltalirz commented Apr 17, 2020 •

edited

Loading

sphuber commented Apr 17, 2020

espenfl commented Apr 17, 2020

espenfl commented Apr 17, 2020

espenfl commented Apr 17, 2020

ltalirz commented Apr 17, 2020 •

edited

Loading

ltalirz commented Apr 17, 2020 •

edited

Loading

espenfl commented Apr 17, 2020 •

edited

Loading

mbercx left a comment •

edited

Loading

mbercx commented Aug 26, 2020

sphuber commented Aug 26, 2020

ltalirz commented Aug 26, 2020

sphuber commented Aug 26, 2020

mbercx commented Aug 26, 2020

espenfl commented Aug 26, 2020

sphuber commented Aug 26, 2020

sphuber commented Aug 27, 2020

mbercx commented Aug 27, 2020

greschd Aug 28, 2020

sphuber Aug 28, 2020

greschd Aug 28, 2020

Add infrastructure to parse scheduler output for CalcJobs #3906

Add infrastructure to parse scheduler output for CalcJobs #3906

Conversation

sphuber commented Apr 7, 2020 • edited Loading

sphuber commented Apr 7, 2020

codecov bot commented Apr 7, 2020 • edited Loading

Codecov Report

ltalirz commented Apr 17, 2020 • edited Loading

sphuber commented Apr 17, 2020

espenfl commented Apr 17, 2020

espenfl commented Apr 17, 2020

espenfl commented Apr 17, 2020

ltalirz commented Apr 17, 2020 • edited Loading

ltalirz commented Apr 17, 2020 • edited Loading

espenfl commented Apr 17, 2020 • edited Loading

mbercx left a comment • edited Loading

Choose a reason for hiding this comment

mbercx commented Aug 26, 2020

sphuber commented Aug 26, 2020

ltalirz commented Aug 26, 2020

sphuber commented Aug 26, 2020

mbercx commented Aug 26, 2020

espenfl commented Aug 26, 2020

sphuber commented Aug 26, 2020

sphuber commented Aug 27, 2020

mbercx commented Aug 27, 2020

greschd Aug 28, 2020

Choose a reason for hiding this comment

sphuber Aug 28, 2020

Choose a reason for hiding this comment

greschd Aug 28, 2020

Choose a reason for hiding this comment

Add infrastructure to parse scheduler output for `CalcJobs` #3906

Add infrastructure to parse scheduler output for `CalcJobs` #3906

sphuber commented Apr 7, 2020 •

edited

Loading

codecov bot commented Apr 7, 2020 •

edited

Loading

ltalirz commented Apr 17, 2020 •

edited

Loading

ltalirz commented Apr 17, 2020 •

edited

Loading

ltalirz commented Apr 17, 2020 •

edited

Loading

espenfl commented Apr 17, 2020 •

edited

Loading

mbercx left a comment •

edited

Loading