Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch subprocess.TimeoutError once #196

Merged
merged 3 commits into from
Jul 4, 2024
Merged

Conversation

Icemole
Copy link
Collaborator

@Icemole Icemole commented Jul 3, 2024

This was the root cause of several jobs with the same features being submitted, even though they had already been submitted:

  1. We use the gateway option and there's a TimeoutError here.
  2. From the point of view of sisyphus, the ssh gw-02 squeue finishes with error -1. Sisyphus then reaches here safely. It hasn't crashed with a TimeoutError because we haven't propagated it.
  3. No TimeoutError is raised after calling self.system_call() because it's already been addressed inside self.system_call() and it hasn't been propagated. Therefore, the process ignores the except block here. If retval == -1, sisyphus doesn't care either!

In my view, the problem comes from not addressing errors that could have been obtained from the return values. Therefore, this PR correctly addresses return codes different from zero, and leaves the work of addressing the exceptions of the subprocesses to self.system_call().

This was the root cause of several jobs with the same features being submitted, even though they had already been submitted
@Icemole Icemole requested review from albertz and michelwi July 3, 2024 11:08
@albertz
Copy link
Member

albertz commented Jul 3, 2024

For reference, related is the recent change in PR #191 on how to handle TimeoutExpired .

@albertz
Copy link
Member

albertz commented Jul 3, 2024

We use the gateway option and there's a TimeoutError here.

I think you mean TimeoutExpired?

@albertz
Copy link
Member

albertz commented Jul 3, 2024

I already asked in #191 whether catching TimeoutExpired is really correct, or whether we just should pass on this exception, exactly as you do now in this PR. In #191, there was no answer to that. See #191 (comment) and #191 (comment).

@albertz
Copy link
Member

albertz commented Jul 3, 2024

As I wrote before, I wonder, why do we actually do this change? Why not keep the original behavior, i.e. just not handle TimeoutExpired in system_call?

@Icemole
Copy link
Collaborator Author

Icemole commented Jul 3, 2024

Thanks for the comments Albert. I didn't see @michelwi's and your comments. Indeed, this was the source of an issue.

I see that the return value is correctly handled below. I will therefore leave the code as it was, catching the TimeoutExpired outside system_call.

@albertz
Copy link
Member

albertz commented Jul 3, 2024

I will therefore leave the code as it was, catching the TimeoutExpired outside system_call.

I.e. you update this PR here, to remove the try:/except TimeoutExpired: in system_call?

@Icemole
Copy link
Collaborator Author

Icemole commented Jul 3, 2024

I will therefore leave the code as it was, catching the TimeoutExpired outside system_call.

I.e. you update this PR here, to remove the try:/except TimeoutExpired: in system_call?

Yes! I did that just now :)

@Icemole Icemole merged commit ce5f7a2 into master Jul 4, 2024
3 checks passed
@Icemole Icemole deleted the too-many-open-files-fix-2 branch July 4, 2024 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants