Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve out of memory error given to users #725

Open
lucyb opened this issue Apr 25, 2024 · 4 comments
Open

Improve out of memory error given to users #725

lucyb opened this issue Apr 25, 2024 · 4 comments

Comments

@lucyb
Copy link
Contributor

lucyb commented Apr 25, 2024

I was doing some Codespaces user testing this week and we came across this error:

Job exited with an error: Job ran out of memory (limit was 4.00GB)

The researcher in question said that this happens intermittently locally and that when it does they close other running applications and tries again. Given that the researcher often had memory-related issues on their machine this was quite understandable, although was not the right thing to do in this instance.

Would it be possible to expand the error message text to make it clearer when the limit is coming from and what to do when it's hit?

@bloodearnest
Copy link
Member

Sadly, its not easy to do this, at least it wasn't when we first implemented this, which is why we have this ambiguity.

Firstly, we don't know if we were OOM killed because we hit the per-job limit (default of 4G in opensafely-cli), or whether we've actually exhausted the entire memory available to the system. Both can and do occur, both locally in and production. e.g. locally, in a 4G codespaces, with the default concurrency of 2, if both running jobs together consume more than the entire 4G available (not unlikely), then it will be system level exhaustion, not job level. In fact, since the default job limit is 4G, and the codespaces is 4G, it will almost always be global system resources triggering the OOM kill.

Secondly, "where the limit is coming from" is different depending on the above. If it is the per-job limit, what to do is different depending on whether its running with opensafely-cli (where a user can change it) or production (where they cannot). If its the system limit, they need to decrease parallelisation or increase memory resources.

The text for this message comes from job-runner, which has to serve both use cases. It may be possible to detect we're running in local_run mode, and add additional text perhaps. We could add text linking to https://docs.opensafely.org/opensafely-cli/#managing-resources, for example.

However, the situation has slightly changed with the introduction of job-runner tracking per job metrics, possibly. In theory, we could include the last recorded memory usage of the job in the text, e.g. the text could maybe be something like

Job exited with an error: Job ran out of memory. It was using N Mb, the per-job limit was Y Mb, and the system free memory was Z Mb.

That would hopefully give enough information for the user to figure out where the limit is?
The job-runner metrics system was not designed to be used in this way, so its a little awkward, but it should be possible

@lucyb
Copy link
Contributor Author

lucyb commented Apr 25, 2024

That would hopefully give enough information for the user to figure out where the limit is?

Yes, if it's not too difficult to do that would be ideal I think. Thank you. You're giving the user enough information that they can understand why the problem occurred.

@sebbacon
Copy link
Contributor

Even something as simple as Job exited with an error: Job ran out of memory. Either increase the limit with --limit, or write your code to use less memory would be a strict improvement, IMO.

@bloodearnest
Copy link
Member

Even something as simple as Job exited with an error: Job ran out of memory. Either increase the limit with --limit, or write your code to use less memory would be a strict improvement, IMO.

I think it's better to link to the docs as I suggested. As --limit is often not the correct solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants