Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build OpenBLAS with CROSS option to prevent tests at compile time #158

Merged

Conversation

marcinz
Copy link
Collaborator

@marcinz marcinz commented Dec 16, 2021

This change would prevent OpenBLAS from running tests during compilation. This is needed when building on machines or in conditions (e.g., using docker) that cause some of the OpenBLAS tests to fail. The potential downside of this is that we would want to run with OpenBLAS checks by default when users are building in the same environment in which they will run. If we want to run OpenBLAS tests by default, we could add an option to prevent testing at build time.

@magnatelee
Copy link
Contributor

Just curious, do you know why the tests fail in those conditions and whether they are actually false negatives (so we can ignore them)?

@manopapad
Copy link
Contributor

When building OpenBLAS in docker on some machines, the following test seems to fail:

TEST 34/35 kernel_regress:skx_avx [OK]
TEST 35/35 fork:safety_after_fork_in_parent [FAIL]
  ERR: Failed to fork process.
RESULTS: 35 tests (34 ok, 1 failed, 0 skipped) ran in 18 ms
make[1]: *** [Makefile:52: run_test] Error 1
make[1]: Leaving directory '/tmp/tmp33ypc5ru/utest'
make: *** [Makefile:147: tests] Error 2
Traceback (most recent call last):

This also happened to @sbak5 recently. I suspect it's due to process spawning limitations placed on the docker runtime, which cause fork to fail (i.e. nothing to do with OpenBLAS itself). We could confirm this hypothesis, by trying to do a normal process fork in a test program inside this environment, and seeing if it works.

@magnatelee
Copy link
Contributor

@manopapad Thanks for sharing your finding. Yes, let's confirm that hypothesis (shouldn't be too hard to do that).

@manopapad
Copy link
Contributor

We could confirm this hypothesis, by trying to do a normal process fork in a test program inside this environment, and seeing if it works.

@marcinz, @sbak5 can one of you check this on the machine that was having issues building OpenBLAS in docker?

@sbak5
Copy link
Contributor

sbak5 commented Dec 16, 2021

@manopapad I'm building it on a machine where I had the problem.

@sbak5
Copy link
Contributor

sbak5 commented Dec 16, 2021

It is built without any error.

@manopapad
Copy link
Contributor

I assume you mean that with this PR the full legate image build gets past the OpenBLAS phase and completes successfully. That is good to know.

Could you also help us test the hypothesis about what is causing the OpenBLAS test failure in the first place? By building a test program that checks whether it can successfully perform a fork, and running that during docker build?

@sbak5
Copy link
Contributor

sbak5 commented Dec 16, 2021

Sure, I'll add a small test program to confirm it to Dockerfile and see if our hypothesis is correct.

@sbak5
Copy link
Contributor

sbak5 commented Dec 16, 2021

I tried to run a test program which fork a process and prints errno if it fails
but it doesn't reproduce the error I had on openblas utest.
Fork() can fail for various reasons (number of threads, memory limit etc).
So, it'd be better to print out the errno of tests in OpenBLAS.

I'm building the image on different kinds of node at internal dlcluster. On a node, it failed without the message above.
On the other node, it failed with the message above. Both fails at the testing process of OpenBLAS but at different steps.

Skipping tests help to avoid any issue I had on different nodes.

@sbak5
Copy link
Contributor

sbak5 commented Dec 17, 2021

More specifically, the error happens at OpenBLAS/utest/openblas_utest

@marcinz
Copy link
Collaborator Author

marcinz commented Dec 17, 2021

I also tried to run a fork bomb in the container in which we build, and there was no problem forking to many levels. So it is not a simple fork that fails. There must be some other conditions in the whole build process that cause the issue. I don't know how much effort it would take us to track this down, so maybe it is not worth it. We can clearly see that some docker/machines fail, while others succeed. In the case in which I encountered the problem, the destination machine was different than the build machine anyways (building on a local machine for Selene).

@marcinz
Copy link
Collaborator Author

marcinz commented Dec 17, 2021

I tried to just run OpenBLAS make, and the error happens even outside of our build process:

./openblas_utest
TEST 1/35 max:smax_zero [OK]
TEST 2/35 max:dmax_positive [OK]
TEST 3/35 max:smax_negative [OK]
TEST 4/35 min:smin_zero [OK]
TEST 5/35 min:dmin_positive [OK]
TEST 6/35 min:smin_negative [OK]
TEST 7/35 amax:damax [OK]
TEST 8/35 amax:samax [OK]
TEST 9/35 ismax:negative_step_2 [OK]
TEST 10/35 ismax:positive_step_2 [OK]
TEST 11/35 ismin:negative_step_2 [OK]
TEST 12/35 ismin:positive_step_2 [OK]
TEST 13/35 drotmg:drotmg_D1_big_D2_big_flag_zero [OK]
TEST 14/35 drotmg:rotmg_D1eqD2_X1eqX2 [OK]
TEST 15/35 drotmg:rotmg_issue1452 [OK]
TEST 16/35 drotmg:rotmg [OK]
TEST 17/35 axpy:caxpy_inc_0 [OK]
TEST 18/35 axpy:saxpy_inc_0 [OK]
TEST 19/35 axpy:zaxpy_inc_0 [OK]
TEST 20/35 axpy:daxpy_inc_0 [OK]
TEST 21/35 zdotu:zdotu_offset_1 [OK]
TEST 22/35 zdotu:zdotu_n_1 [OK]
TEST 23/35 dsdot:dsdot_n_1 [OK]
TEST 24/35 swap:cswap_inc_0 [OK]
TEST 25/35 swap:sswap_inc_0 [OK]
TEST 26/35 swap:zswap_inc_0 [OK]
TEST 27/35 swap:dswap_inc_0 [OK]
TEST 28/35 rot:csrot_inc_0 [OK]
TEST 29/35 rot:srot_inc_0 [OK]
TEST 30/35 rot:zdrot_inc_0 [OK]
TEST 31/35 rot:drot_inc_0 [OK]
TEST 32/35 potrf:smoketest_trivial [OK]
TEST 33/35 potrf:bug_695 [OK]
TEST 34/35 kernel_regress:skx_avx [OK]
TEST 35/35 fork:safety_after_fork_in_parent [FAIL]
  ERR: Failed to fork process.
RESULTS: 35 tests (34 ok, 1 failed, 0 skipped) ran in 20 ms
make[1]: *** [Makefile:52: run_test] Error 1
make[1]: Leaving directory '/home/legate-user/OpenBLAS/utest'
make: *** [Makefile:147: tests] Error 2

@manopapad
Copy link
Contributor

@marcinz Can you add some code here to check what the value of errno is when the fork call fails https://github.com/xianyi/OpenBLAS/blob/develop/utest/test_fork.c#L111? Hopefully this will tell us what's wrong https://man7.org/linux/man-pages/man2/fork.2.html#ERRORS.

@marcinz
Copy link
Collaborator Author

marcinz commented Dec 17, 2021

So this is the line that fails:

https://github.com/xianyi/OpenBLAS/blob/v0.3.15/utest/test_post_fork.c#L112

The failure is ENOMEM ("Cannot allocate memory").

@marcinz
Copy link
Collaborator Author

marcinz commented Dec 17, 2021

From what I've been reading, this may be related to fork's pessimistic view on memory allocation where fork may return ENOMEM if it is possible that the system will run out of memory because of the overcommit policy (setting overcommit policy to a different value may prevent the problem). I've noticed that on the machine where the build works, swap is enabled:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:         515880       22715      491543           2        1620      489338
Swap:          8191          21        8170

while on the machine where swap is disabled, the build fails:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:         515900        5613      481706          42       28579      507834
Swap:             0           0           0

@sbak5 Could you check your machine whether swap is enabled or disabled and also check if your memory overcommit is set to 0 with sysctl vm.overcommit_memory?

@marcinz
Copy link
Collaborator Author

marcinz commented Dec 17, 2021

So I enabled swap on the second machine from my previous comment, and the build worked fine. Note that it had to also be enough swap. 1 GB was too little. The next value I tried and that made the build work was 100 GB, but I am certain that the minimum amount of swap necessary is closer to 1 GB (see the first machine in the previous comment) than to 100 GB.

The conclusion here, in my opinion, is that the authors of OpenBLAS have never run this test in unfavorable conditions such as the combination of the memory overcommit policy and the lack of swap as on the machines that are giving us trouble. I think that this should not be a problem most of the time since the problems with fork can be worked around. I am not really sure what Python/NumPy does and if it ever hits such a situation, but the important thing to keep in mind is that just because this test fails the actual real-world usage of OpenBLAS may not fail in the same way.

We have a few options:

  1. Ignore the problem and always use the CROSS option.
  2. Make this a known issue and ask users to turn on CROSS option when this issue occurs or to create enough swap space to make it go away.
  3. Actually file a bug report upstream hoping that this will be fixed at some point and use CROSS until then.
  4. Other options...

@sbak5
Copy link
Contributor

sbak5 commented Dec 17, 2021

The machine I used has the following config.
root@30ef98a421cf:/workspace# free -m
total used free shared buff/cache available
Mem: 31793 6327 837 2 24627 25189
Swap: 2047 29 2018
root@30ef98a421cf:/workspace# sysctl vm.overcommit_memory
vm.overcommit_memory = 0
root@30ef98a421cf:/workspace# sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

@manopapad
Copy link
Contributor

Let's go with option (3).

@marcinz, could you please open an issue about this on the OpenBLAS bug tracker? I would especially note that this particular test fails even when running bare-metal on your local machine, but doesn't stop OpenBLAS from running correctly there.

And let's merge this change, so we skip the OpenBLAS tests when building, and don't have to deal with such issues.

@manopapad manopapad merged commit 4865da4 into nv-legate:branch-22.01 Dec 22, 2021
@manopapad manopapad mentioned this pull request Dec 22, 2021
fduguet-nv pushed a commit to fduguet-nv/cunumeric that referenced this pull request Mar 29, 2022
Cleanup / Initial pass for type annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants