Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: do not fail on oom init failure #6004

Merged
merged 3 commits into from
May 24, 2024

Conversation

chu11
Copy link
Member

@chu11 chu11 commented May 23, 2024

Problem: The oom detector in the job shell is not critical for job execution. However, if it fails to be initialized, the job shell
will exit with a shll.init failure and the job will fail.

Solution: Output a warning that oom detection is disabled when initialization fails. Do not fail to execute the job-shell. The user should now see something like the following for the inotify error.

2.220s: flux-shell[69]: WARN: oom: error setting up inotify: Too many open files
2.220s: flux-shell[69]: WARN: oom: oom detection disabled


I was in the general area of this code b/c of #6002, so decided to try and deal with this issue.

Question, should i log ERROR instead of WARN for the actual inotify error? B/c that is an actual error .. the warning is really that the detection is disabled.

@grondo
Copy link
Contributor

grondo commented May 23, 2024

A couple quick thoughts.

  • It might be nice to have a more specific error for EMFILE, e.g. max number of user inotify instances has been reached or similar.
  • I'm not sure if this is practical, but having one error message/warning might be better than two, e.g. max inotify instances reached: disabling oom detection and that's it.

@chu11
Copy link
Member Author

chu11 commented May 23, 2024

I'm not sure if this is practical, but having one error message/warning might be better than two, e.g. max inotify instances reached: disabling oom detection and that's it.

Ahhh, lets pass a flux_error_t into the init function. That'll do the trick.

@chu11 chu11 force-pushed the issue5920_shell_oom branch from 36d6788 to c853a4b Compare May 23, 2024 17:23
@chu11
Copy link
Member Author

chu11 commented May 23, 2024

re-pushed, tweaking per some comments above

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one very minor nit inline.

ERRNO_SAFE_WRAP (free, path);
return -1;
shell_warn ("disabling oom detection: %s", error.text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If (very unlikely) flux_plugin_get_shell() fails, then error.text will be used uninitialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh and I guess I didn't set an error message for the potential calloc() failure. I moved the !shell check into oom_create() and set an appropriate error message there.

Also noticed some tabbing was wrong, so added a commit for that as well.

@chu11 chu11 force-pushed the issue5920_shell_oom branch 2 times, most recently from 7bd8549 to d1fda0e Compare May 24, 2024 16:46
chu11 added 3 commits May 24, 2024 17:07
Problem: Some function parameters in the oom_create() function were
not tabbed over correctly.

Fix the tabbing.
Problem: When inotify fails for the oom plugin, the job-shell outputs
a fairly useless error message of:

1.202s: flux-shell[98]: ERROR: plugin 'oom': shell.init failed
1.202s: flux-shell[98]: FATAL: shell_init

Solution: Add extra messages so the user has more of a
clue why the job did not run.  For the inotify error, the user
should now see someting like the following.

7.111s: flux-shell[11]:  WARN: oom: max number of user inotify instances has been reached
Problem: The oom detector in the job shell is not critical for job
execution.  However, if it fails to be initialized, the job shell
will exit with a shll.init failure and the job will fail.

Solution: Output a warning that oom detection is disabled when initialization
fails.  Do not fail to execute the job-shell.  The user should now see
something like the following for an inotify error.

2.636s: flux-shell[113]:  WARN: oom: disabling oom detection: max number of user inotify instances has been reached

Fixes flux-framework#5920
@chu11 chu11 force-pushed the issue5920_shell_oom branch from d1fda0e to 79b4d5c Compare May 24, 2024 17:07
@mergify mergify bot merged commit fe2d327 into flux-framework:master May 24, 2024
33 checks passed
@chu11 chu11 deleted the issue5920_shell_oom branch May 29, 2024 17:56
Copy link

codecov bot commented Aug 22, 2024

Codecov Report

Attention: Patch coverage is 6.66667% with 14 lines in your changes missing coverage. Please review.

Project coverage is 83.28%. Comparing base (8fa31c9) to head (79b4d5c).
Report is 469 commits behind head on master.

Files Patch % Lines
src/shell/oom.c 6.66% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6004      +/-   ##
==========================================
- Coverage   83.28%   83.28%   -0.01%     
==========================================
  Files         518      518              
  Lines       83427    83439      +12     
==========================================
+ Hits        69486    69492       +6     
- Misses      13941    13947       +6     
Files Coverage Δ
src/shell/oom.c 21.66% <6.66%> (-1.49%) ⬇️

... and 13 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants