Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Fix possible OOM error Process completed with exit code 137 #409

Closed
akihironitta opened this issue Nov 26, 2020 · 19 comments
Closed

ci: Fix possible OOM error Process completed with exit code 137 #409

akihironitta opened this issue Nov 26, 2020 · 19 comments
Assignees
Labels
bug Something isn't working ci/cd Continues Integration and delivery help wanted Extra attention is needed
Milestone

Comments

@akihironitta
Copy link
Contributor

akihironitta commented Nov 26, 2020

🐛 Bug

Seems CI full testing / pytest (ubuntu-20.04, *, *) particularly tend to fail with the error:

/home/runner/work/_temp/5ef79e81-ccef-44a4-91a6-610886c324a6.sh: line 2:  1855 Killed                  coverage run --source pl_bolts -m pytest pl_bolts tests --exitfirst -v --junitxml=junit/test-results-Linux-3.7-latest.xml
Error: Process completed with exit code 137.

Example CI runs

This error might happen on different os or different versions. Haven't investigated yet.

To Reproduce

Not sure how to reproduce...

Additional context

Found while handling the dataset caching issue in #387 (comment).

@akihironitta akihironitta added fix fixing issues... help wanted Extra attention is needed ci/cd Continues Integration and delivery labels Nov 26, 2020
@Borda
Copy link
Member

Borda commented Nov 26, 2020

is it due to timeout? how long does it run before kill?

@akihironitta
Copy link
Contributor Author

The two runs were killed after 7m 43s and 7m 20s, so I guess it's not due to timeout...

@Borda
Copy link
Member

Borda commented Nov 27, 2020

The two runs were killed after 7m 43s and 7m 20s, so I guess it's not due to timeout...

you are right, the CI timeout is 45min so it can be some random failer or it is always the same test configuration?

@akihironitta
Copy link
Contributor Author

CI full testing / pytest (ubuntu-20.04, 3.8, latest) also fails in:

Seems it happens particularly on Ubuntu...(?)

@akihironitta
Copy link
Contributor Author

akihironitta commented Nov 28, 2020

It seems the following runs on Ubuntu were also unexpectedly killed due to probably the same reason:

Haven't checked many runs yet, but I've never seen windows and macos processes were killed...


The above three runs were killed while testing tests/models/test_scripts.py::test_cli_run_vision_image_gpt.

error log

tests/models/test_scripts.py::test_cli_run_log_regression[--max_epochs 1 --max_steps 2] PASSED [ 51%]
/home/runner/work/_temp/ad14f744-ecd1-4908-beaa-c3ef9af1bc7c.sh: line 2:  2831 Killed                  coverage run --source pl_bolts -m py.test pl_bolts tests -v --junitxml=junit/test-results-Linux-3.8-latest.xml
tests/models/test_scripts.py::test_cli_run_vision_image_gpt[--data_dir /home/runner/work/pytorch-lightning-bolts/pytorch-lightning-bolts/datasets --max_epochs 1 --max_steps 2] 
Error: Process completed with exit code 137.

@akihironitta
Copy link
Contributor Author

May OOM be the reason why the processes were killed? Stack Overflow - Process finished with exit code 137 in PyCharm

I will see if Ubuntu doesn't have as much memory as macOS and Windows on GitHub Actions...

@akihironitta
Copy link
Contributor Author

akihironitta commented Nov 28, 2020

As I checked the docs, all envs have the same hardware resources, but something might occupy a lot of memory particularly on Ubuntu.

2-core CPU
7 GB of RAM memory
14 GB of SSD disk space

GitHub Docs - Specifications for GitHub-hosted runners

@akihironitta
Copy link
Contributor Author

Some of the runs were killed while testing the followings:

  • tests/models/self_supervised/test_models.py::test_moco
  • tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_moco[--data_dir /home/runner/work/pytorch-lightning-bolts/pytorch-lightning-bolts/datasets --max_epochs 1 --max_steps 3 --fast_dev_run --batch_size 2]
  • tests/models/test_scripts.py::test_cli_run_vision_image_gpt

Does this imply that we should lower batch_size and/or other parameters?

@akihironitta
Copy link
Contributor Author

I am getting confident that out-of-memory may be the cause of the kills as I checked memory usage with this script in akihironitta#1.

I observed the following output from https://github.com/akihironitta/pytorch-lightning-bolts/pull/1/checks?check_run_id=1468220929, which indicates that almost all memory was consumed while the tests were running (although the run was successful):

      date     time           total        used        free      shared  buff/cache   available
2020-11-28 21:05:53            6954         508        2774           3        3671        6145
2020-11-28 21:05:55            6954         517        2763           3        3673        6137
...
2020-11-28 21:13:48            6954        6768         106           1          80           3
2020-11-28 21:13:49            6954        6760         113           1          80          10
2020-11-28 21:13:51            6954        6778         104           1          70          53
2020-11-28 21:13:52            6954        6774         106           0          73           1
2020-11-28 21:13:53            6954        6771         104           0          77           1
...

[in MiB]

@akihironitta
Copy link
Contributor Author

akihironitta commented Dec 1, 2020

As I increased batch_size in akihironitta#1, all the runs failed due to the same error:

Error: Process completed with exit code 137.

, which probably implies that the memory size isn't enough for Bolts' tests.

So, I think the options we have are:

  1. increase the memory size of CI
  2. use smaller network architectures
  3. use smaller batch_size (I tried to reduce the batch size (2 to 1), but since some models use batch normalization the minimum required batch size is 2, which was confirmed in [wip] ci: Reduce memory usage to avoid unexpected kills #411)

@akihironitta
Copy link
Contributor Author

akihironitta commented Dec 5, 2020

@Borda Do you think we can increase the memory size or use smaller network architecture for all models?


It seems almost all the CI runs failed on the latest commit to master branch due to the same error.

@Borda
Copy link
Member

Borda commented Dec 16, 2020

I do not think we can di much about the memory, rather let's try to use a smaller model, is it possible?

@akihironitta
Copy link
Contributor Author

@Borda For models with replaceable backbones, we can define smaller backbones and replace relatively big ones like resnet with small ones. I'm not sure if that's really possible, but at least I can try (hopefully this year...).

@Borda Borda added this to the v0.3 milestone Jan 18, 2021
@Borda Borda modified the milestones: v0.3, v0.4 Jan 22, 2021
@akihironitta akihironitta changed the title ci: Fix Error: Process completed with exit code 137 ci: Fix possible OOM error Process completed with exit code 137 Mar 30, 2021
@Borda
Copy link
Member

Borda commented Jul 6, 2021

I think that @awaelchli solved it for PL with a shared memory argument
not the case as this issue was for GH actions, not Azure pipeline here we used docker images...

lest reopen if it appears again.. 🐰

@alexander-soare
Copy link

Hi team, sorry to be snooping around. I googled this issue as I'm having the same problem working on the timm library huggingface/pytorch-image-models#993. But I found that no individual tests trigger OOM, so it feels like it's something to do with memory not being released properly between tests. Wondering if you had similar experiences. Many thanks

@connor-mccorm
Copy link

Hi @alexander-soare was wondering if you had discovered anything regarding your comment above! I'm experiencing the same issue where no individual test triggers an OOM and your theory about memory releasing improperly seems like a viable explanation.

@alexander-soare
Copy link

@connor-mccorm unfortunately we never figured it out. We just came up with band-aid solutions: running the tests in individual chunks or with multiprocessing.

@connor-mccorm
Copy link

@alexander-soare good to know. Thanks for the information!

@ducanhle31
Copy link

image

I'm having the same problem, how to fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci/cd Continues Integration and delivery help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants