Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit caching to successful calculations? #3368

Closed
greschd opened this issue Oct 1, 2019 · 12 comments · Fixed by #3637
Closed

Limit caching to successful calculations? #3368

greschd opened this issue Oct 1, 2019 · 12 comments · Fixed by #3637

Comments

@greschd
Copy link
Member

greschd commented Oct 1, 2019

In the previous AiiDA version, caching checked that a calculation was in FINISHED state before using it as a basis for caching -- the purpose of this being that it allows failed calculations to be re-run.

With the updates to 1.0, the concept of calculation states was overhauled, and now a calculation can be finished even if it failed (with non-zero exit codes).

I believe the caching implementation should be adapted so that it's only triggered when the exit status is zero.

@sphuber @giovannipizzi do you agree with this?

@sphuber
Copy link
Contributor

sphuber commented Oct 1, 2019

We actually considered this problem and decided to also consider failed calculations as cacheable. The use-case comes from developing workchains. Caching is really useful to not have to repeat the calculations themselves when rerunning a workchain. However, often you want to test the error handling here which is only triggered by "failed" calculations. Excluding them from the caching then forces them to be rerun. The difference with the old system is that the old "FAILED" state really meant all problems, not just soft failures indicated by the parser (convergence not reached) but also just plain exceptions etc. The new "failed" concept of FINISHED + non-zero exit status is a lot "softer". It doesn't necessary indicate a real failure.

@greschd
Copy link
Member Author

greschd commented Oct 1, 2019

Yeah, but it might.. the reason I've encountered this is fixing some intermittent issue on the computer, after which the calculation still failed because it was cached. I don't know how to make this calculation usable again with the current setup.

In my opinion, being cautious with caching is the preferable approach -- i.e., only cache when you're certain the calculation consistently behaves that way.

@greschd
Copy link
Member Author

greschd commented Oct 1, 2019

As a plugin developer, you can always override the is_valid_cache property to for example allow certain "known good" error codes.

@greschd
Copy link
Member Author

greschd commented Oct 1, 2019

In https://github.com/aiidateam/aiida-quantumespresso/blob/develop/aiida_quantumespresso/calculations/pw.py for example, the 3xx error codes would be of the kind that should probably not be cached, while 4xx and 5xx could be.

@sphuber
Copy link
Contributor

sphuber commented Oct 1, 2019

Yeah, but it might.. the reason I've encountered this is fixing some intermittent issue on the computer, after which the calculation still failed because it was cached. I don't know how to make this calculation usable again with the current setup.

You can remove the hash extra from the calculation you don't want to be cached from. Not ideal mechanic but that is the only way to "disable" individual nodes from being considered. Or is that not what you were asking here?

In my opinion, being cautious with caching is the preferable approach -- i.e., only cache when you're certain the calculation consistently behaves that way.

I see, fair enough, but I still think that this is going to be a big limiting factor on the use-case for workchain development. Without caching, debugging workchains is not really feasible.

As a plugin developer, you can always override the is_valid_cache property to for example allow certain "known good" error codes.

Except the problem is that this is a Node property and as a plugin developer you cannot change that. Setting it on the PwCalculation process class is no good, right?

So in conclusion, I think you raise a good point: caching, when used in production, needs to be used carefully and always err on the safe side. However, I still think that it is currently hugely important in debugging modes, and only allowing successful calculations might be too restrictive. I am not sure how this could be turned on through configuration either.

@greschd
Copy link
Member Author

greschd commented Oct 1, 2019

Yeah, manually disabling it would work (as would just deleting the broken calc), but it's indeed not a great solution.

Except the problem is that this is a Node property and as a plugin developer you cannot change that. Setting it on the PwCalculation process class is no good, right?

Ah yes, that seems to be an unintended consequence of the node / calc split. So maybe the CalcJobNode implementation for is_valid_cache should actually defer this to the specific calculation implementation? We could make it really explicit by having something like a valid_cache_exit_codes hook in the calculation class.

@greschd
Copy link
Member Author

greschd commented Oct 1, 2019

More generally, I think there needs to be a way for plugin developers to disable caching for a given calculation class -- which was the purpose of is_valid_cache. One might imagine a plugin for a probabilistic code that should never cache.

EDIT Yes, you can do that in configuration, but the end user shouldn't have to care - just putting default: True should be a perfectly acceptable caching config.

@sphuber
Copy link
Contributor

sphuber commented Oct 1, 2019

More generally, I think there needs to be a way for plugin developers to disable caching for a given calculation class -- which was the purpose of is_valid_cache. One might imagine a plugin for a probabilistic code that should never cache.

This can be done though, through the configuration file. We had the same problem there that after the process/node split, the original syntax of that file, which expected node sub classes didn't work. But I fixed that and you can now specify a process class and those will now not be cached.

@greschd
Copy link
Member Author

greschd commented Oct 1, 2019

Sorry, should've made a separate comment instead of edit: I think the end user shouldn't have to care about this -- default: True should be a perfectly valid caching config. But that's not quite as important, in part because we don't have any plugins currently that would need it.

The broader point is that I think is_valid_cache is something that the calculation implementation should have some control over.

@greschd
Copy link
Member Author

greschd commented Oct 1, 2019

In the workchain development, would restricting it to certain exit codes be enough? Or are there cases where you need to test more generic errors as well?

Another option for that would be to introduce different levels of caching, "normal" for production use, "aggressive" for workchain development -- differing in how it handles these cases.

@zhubonan
Copy link
Contributor

I am also interested in this - allowing some calculations's caching to be turned off (or invalided) for specific exit_status.
One example, use cases would be for handling unfinished calculations due to time limit. At the moment, max_wallclock_seconds is not used for caching for CalcJob. If I have a calculations which were killed due to the time limit by the scheduler and re-run with increased max_wallclock_seconds, aiida will give me the killed run again.
On the other hand, it does not make sense to just remove the max_wallclock_seconds from the list of ignored attributes for computing the hash. Otherwise identical calculations with different settings are considered different.
Perhaps the solution would be to add something in the process steps in CalcJob (on_finish)? I wonder if it is safe the override the result method at the plugin level.

@greschd
Copy link
Member Author

greschd commented Nov 22, 2019

I think this could also be solved by giving the CalcJob class some control over is_valid_cache. It's an interesting point that we might want to split this by exit codes - perhaps by default the is_valid_cache should look up from some plugin-defined list of known "cacheable" (or non-cacheable) exit codes. Could also be configurable from the spec.exit_code call.

In the particular case of the wallclock, it depends on whether you can restart from an interrupted calculation whether caching should be done or not. If you can, then this restart is something that should be done at the workflow level - starting again from scratch would be a waste of resources.

In any case, this is probably something I will look into at the coding week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants