-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Fix possible OOM error Process completed with exit code 137
#409
Comments
is it due to timeout? how long does it run before kill? |
The two runs were killed after 7m 43s and 7m 20s, so I guess it's not due to timeout... |
you are right, the CI timeout is 45min so it can be some random failer or it is always the same test configuration? |
Seems it happens particularly on Ubuntu...(?) |
It seems the following runs on Ubuntu were also unexpectedly killed due to probably the same reason:
Haven't checked many runs yet, but I've never seen windows and macos processes were killed... The above three runs were killed while testing error log
tests/models/test_scripts.py::test_cli_run_log_regression[--max_epochs 1 --max_steps 2] PASSED [ 51%]
/home/runner/work/_temp/ad14f744-ecd1-4908-beaa-c3ef9af1bc7c.sh: line 2: 2831 Killed coverage run --source pl_bolts -m py.test pl_bolts tests -v --junitxml=junit/test-results-Linux-3.8-latest.xml
tests/models/test_scripts.py::test_cli_run_vision_image_gpt[--data_dir /home/runner/work/pytorch-lightning-bolts/pytorch-lightning-bolts/datasets --max_epochs 1 --max_steps 2]
Error: Process completed with exit code 137. |
May OOM be the reason why the processes were killed? Stack Overflow - Process finished with exit code 137 in PyCharm I will see if Ubuntu doesn't have as much memory as macOS and Windows on GitHub Actions... |
As I checked the docs, all envs have the same hardware resources, but something might occupy a lot of memory particularly on Ubuntu.
|
Some of the runs were killed while testing the followings:
Does this imply that we should lower |
I am getting confident that out-of-memory may be the cause of the kills as I checked memory usage with this script in akihironitta#1. I observed the following output from https://github.com/akihironitta/pytorch-lightning-bolts/pull/1/checks?check_run_id=1468220929, which indicates that almost all memory was consumed while the tests were running (although the run was successful):
[in MiB] |
As I increased batch_size in akihironitta#1, all the runs failed due to the same error:
, which probably implies that the memory size isn't enough for Bolts' tests. So, I think the options we have are:
|
@Borda Do you think we can increase the memory size or use smaller network architecture for all models? It seems almost all the CI runs failed on the latest commit to |
I do not think we can di much about the memory, rather let's try to use a smaller model, is it possible? |
@Borda For models with replaceable backbones, we can define smaller backbones and replace relatively big ones like resnet with small ones. I'm not sure if that's really possible, but at least I can try (hopefully this year...). |
Error: Process completed with exit code 137
Process completed with exit code 137
I think that @awaelchli solved it for PL with a shared memory argument lest reopen if it appears again.. 🐰 |
Hi team, sorry to be snooping around. I googled this issue as I'm having the same problem working on the timm library huggingface/pytorch-image-models#993. But I found that no individual tests trigger OOM, so it feels like it's something to do with memory not being released properly between tests. Wondering if you had similar experiences. Many thanks |
Hi @alexander-soare was wondering if you had discovered anything regarding your comment above! I'm experiencing the same issue where no individual test triggers an OOM and your theory about memory releasing improperly seems like a viable explanation. |
@connor-mccorm unfortunately we never figured it out. We just came up with band-aid solutions: running the tests in individual chunks or with multiprocessing. |
@alexander-soare good to know. Thanks for the information! |
🐛 Bug
Seems
CI full testing / pytest (ubuntu-20.04, *, *)
particularly tend to fail with the error:Example CI runs
This error might happen on different os or different versions. Haven't investigated yet.
To Reproduce
Not sure how to reproduce...
Additional context
Found while handling the dataset caching issue in #387 (comment).
The text was updated successfully, but these errors were encountered: