Limit caching to successful calculations? #3368

greschd · 2019-10-01T13:35:21Z

In the previous AiiDA version, caching checked that a calculation was in FINISHED state before using it as a basis for caching -- the purpose of this being that it allows failed calculations to be re-run.

With the updates to 1.0, the concept of calculation states was overhauled, and now a calculation can be finished even if it failed (with non-zero exit codes).

I believe the caching implementation should be adapted so that it's only triggered when the exit status is zero.

@sphuber @giovannipizzi do you agree with this?

sphuber · 2019-10-01T13:46:20Z

We actually considered this problem and decided to also consider failed calculations as cacheable. The use-case comes from developing workchains. Caching is really useful to not have to repeat the calculations themselves when rerunning a workchain. However, often you want to test the error handling here which is only triggered by "failed" calculations. Excluding them from the caching then forces them to be rerun. The difference with the old system is that the old "FAILED" state really meant all problems, not just soft failures indicated by the parser (convergence not reached) but also just plain exceptions etc. The new "failed" concept of FINISHED + non-zero exit status is a lot "softer". It doesn't necessary indicate a real failure.

greschd · 2019-10-01T13:50:23Z

Yeah, but it might.. the reason I've encountered this is fixing some intermittent issue on the computer, after which the calculation still failed because it was cached. I don't know how to make this calculation usable again with the current setup.

In my opinion, being cautious with caching is the preferable approach -- i.e., only cache when you're certain the calculation consistently behaves that way.

greschd · 2019-10-01T13:51:27Z

As a plugin developer, you can always override the is_valid_cache property to for example allow certain "known good" error codes.

greschd · 2019-10-01T13:57:34Z

In https://github.com/aiidateam/aiida-quantumespresso/blob/develop/aiida_quantumespresso/calculations/pw.py for example, the 3xx error codes would be of the kind that should probably not be cached, while 4xx and 5xx could be.

sphuber · 2019-10-01T14:06:13Z

Yeah, but it might.. the reason I've encountered this is fixing some intermittent issue on the computer, after which the calculation still failed because it was cached. I don't know how to make this calculation usable again with the current setup.

You can remove the hash extra from the calculation you don't want to be cached from. Not ideal mechanic but that is the only way to "disable" individual nodes from being considered. Or is that not what you were asking here?

In my opinion, being cautious with caching is the preferable approach -- i.e., only cache when you're certain the calculation consistently behaves that way.

I see, fair enough, but I still think that this is going to be a big limiting factor on the use-case for workchain development. Without caching, debugging workchains is not really feasible.

As a plugin developer, you can always override the is_valid_cache property to for example allow certain "known good" error codes.

Except the problem is that this is a Node property and as a plugin developer you cannot change that. Setting it on the PwCalculation process class is no good, right?

So in conclusion, I think you raise a good point: caching, when used in production, needs to be used carefully and always err on the safe side. However, I still think that it is currently hugely important in debugging modes, and only allowing successful calculations might be too restrictive. I am not sure how this could be turned on through configuration either.

greschd · 2019-10-01T14:16:43Z

Yeah, manually disabling it would work (as would just deleting the broken calc), but it's indeed not a great solution.

Except the problem is that this is a Node property and as a plugin developer you cannot change that. Setting it on the PwCalculation process class is no good, right?

Ah yes, that seems to be an unintended consequence of the node / calc split. So maybe the CalcJobNode implementation for is_valid_cache should actually defer this to the specific calculation implementation? We could make it really explicit by having something like a valid_cache_exit_codes hook in the calculation class.

greschd · 2019-10-01T14:23:33Z

More generally, I think there needs to be a way for plugin developers to disable caching for a given calculation class -- which was the purpose of is_valid_cache. One might imagine a plugin for a probabilistic code that should never cache.

EDIT Yes, you can do that in configuration, but the end user shouldn't have to care - just putting default: True should be a perfectly acceptable caching config.

sphuber · 2019-10-01T14:28:56Z

More generally, I think there needs to be a way for plugin developers to disable caching for a given calculation class -- which was the purpose of is_valid_cache. One might imagine a plugin for a probabilistic code that should never cache.

This can be done though, through the configuration file. We had the same problem there that after the process/node split, the original syntax of that file, which expected node sub classes didn't work. But I fixed that and you can now specify a process class and those will now not be cached.

greschd · 2019-10-01T14:33:19Z

Sorry, should've made a separate comment instead of edit: I think the end user shouldn't have to care about this -- default: True should be a perfectly valid caching config. But that's not quite as important, in part because we don't have any plugins currently that would need it.

The broader point is that I think is_valid_cache is something that the calculation implementation should have some control over.

greschd · 2019-10-01T14:39:39Z

In the workchain development, would restricting it to certain exit codes be enough? Or are there cases where you need to test more generic errors as well?

Another option for that would be to introduce different levels of caching, "normal" for production use, "aggressive" for workchain development -- differing in how it handles these cases.

zhubonan · 2019-11-22T10:53:49Z

I am also interested in this - allowing some calculations's caching to be turned off (or invalided) for specific exit_status.
One example, use cases would be for handling unfinished calculations due to time limit. At the moment, max_wallclock_seconds is not used for caching for CalcJob. If I have a calculations which were killed due to the time limit by the scheduler and re-run with increased max_wallclock_seconds, aiida will give me the killed run again.
On the other hand, it does not make sense to just remove the max_wallclock_seconds from the list of ignored attributes for computing the hash. Otherwise identical calculations with different settings are considered different.
Perhaps the solution would be to add something in the process steps in CalcJob (on_finish)? I wonder if it is safe the override the result method at the plugin level.

greschd · 2019-11-22T11:03:44Z

I think this could also be solved by giving the CalcJob class some control over is_valid_cache. It's an interesting point that we might want to split this by exit codes - perhaps by default the is_valid_cache should look up from some plugin-defined list of known "cacheable" (or non-cacheable) exit codes. Could also be configurable from the spec.exit_code call.

In the particular case of the wallclock, it depends on whether you can restart from an interrupted calculation whether caching should be done or not. If you can, then this restart is something that should be done at the workflow level - starting again from scratch would be a waste of resources.

In any case, this is probably something I will look into at the coding week.

greschd added priority/important topic/caching labels Oct 1, 2019

greschd mentioned this issue Dec 10, 2019

Add hook and exit_code keyword for more nuanced cache validation. #3637

Merged

sphuber closed this as completed in #3637 Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit caching to successful calculations? #3368

Limit caching to successful calculations? #3368

greschd commented Oct 1, 2019

sphuber commented Oct 1, 2019

greschd commented Oct 1, 2019

greschd commented Oct 1, 2019

greschd commented Oct 1, 2019

sphuber commented Oct 1, 2019 •

edited

Loading

greschd commented Oct 1, 2019 •

edited

Loading

greschd commented Oct 1, 2019 •

edited

Loading

sphuber commented Oct 1, 2019

greschd commented Oct 1, 2019

greschd commented Oct 1, 2019

zhubonan commented Nov 22, 2019

greschd commented Nov 22, 2019

Limit caching to successful calculations? #3368

Limit caching to successful calculations? #3368

Comments

greschd commented Oct 1, 2019

sphuber commented Oct 1, 2019

greschd commented Oct 1, 2019

greschd commented Oct 1, 2019

greschd commented Oct 1, 2019

sphuber commented Oct 1, 2019 • edited Loading

greschd commented Oct 1, 2019 • edited Loading

greschd commented Oct 1, 2019 • edited Loading

sphuber commented Oct 1, 2019

greschd commented Oct 1, 2019

greschd commented Oct 1, 2019

zhubonan commented Nov 22, 2019

greschd commented Nov 22, 2019

sphuber commented Oct 1, 2019 •

edited

Loading

greschd commented Oct 1, 2019 •

edited

Loading

greschd commented Oct 1, 2019 •

edited

Loading