-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve out of memory error given to users #725
Comments
Sadly, its not easy to do this, at least it wasn't when we first implemented this, which is why we have this ambiguity. Firstly, we don't know if we were OOM killed because we hit the per-job limit (default of 4G in opensafely-cli), or whether we've actually exhausted the entire memory available to the system. Both can and do occur, both locally in and production. e.g. locally, in a 4G codespaces, with the default concurrency of 2, if both running jobs together consume more than the entire 4G available (not unlikely), then it will be system level exhaustion, not job level. In fact, since the default job limit is 4G, and the codespaces is 4G, it will almost always be global system resources triggering the OOM kill. Secondly, "where the limit is coming from" is different depending on the above. If it is the per-job limit, what to do is different depending on whether its running with opensafely-cli (where a user can change it) or production (where they cannot). If its the system limit, they need to decrease parallelisation or increase memory resources. The text for this message comes from job-runner, which has to serve both use cases. It may be possible to detect we're running in local_run mode, and add additional text perhaps. We could add text linking to https://docs.opensafely.org/opensafely-cli/#managing-resources, for example. However, the situation has slightly changed with the introduction of job-runner tracking per job metrics, possibly. In theory, we could include the last recorded memory usage of the job in the text, e.g. the text could maybe be something like
That would hopefully give enough information for the user to figure out where the limit is? |
Yes, if it's not too difficult to do that would be ideal I think. Thank you. You're giving the user enough information that they can understand why the problem occurred. |
Even something as simple as |
I think it's better to link to the docs as I suggested. As --limit is often not the correct solution. |
I was doing some Codespaces user testing this week and we came across this error:
Job exited with an error: Job ran out of memory (limit was 4.00GB)
The researcher in question said that this happens intermittently locally and that when it does they close other running applications and tries again. Given that the researcher often had memory-related issues on their machine this was quite understandable, although was not the right thing to do in this instance.
Would it be possible to expand the error message text to make it clearer when the limit is coming from and what to do when it's hit?
The text was updated successfully, but these errors were encountered: