Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

gcc8+ memory usage regression for compiling indexing_op.o #18501

Open
wkcn opened this issue Jun 6, 2020 · 31 comments
Open

gcc8+ memory usage regression for compiling indexing_op.o #18501

wkcn opened this issue Jun 6, 2020 · 31 comments

Comments

@wkcn
Copy link
Member

wkcn commented Jun 6, 2020

Description

Hi there, I try to build MXNet2.0 (only cpu) in my laptop with 16GB memory. I found that it takes over 16GB memory to compile a single file src/operator/tensor/index_op.o. I need to create extra 8GB virtual memory for building this file.

Is it possible to divide indexing_op into multiple small files to reduce the memory cost?

Environment

The latest code of MXNet 2.0
Arch Linux

Conclusion

The issue has been solved.

The cost of memory depends on the compiler and the building method (ninja or make)
I build indexing_op.o by ninja with different version of gcc.

Compiler The cost of memory(Child high-water RSS)
g++ 6.4.1 1.95 GB
g++ 7.4.1 1.78 GB
g++ 10.1.0 11 GB

Besides, since the compiler flags is different in different building ways (for example Makefile enable -funroll-loops, it will takes more memory), the cost of memory is different.

The solution is to build MXNet with g++-6 or g++-7.

@leezu
Copy link
Contributor

leezu commented Jun 7, 2020

Fixing this would be a welcome improvement. Did you investigate if the high memory consumption is consistent among gcc and clang, as well as still present on gcc 9 (or 10) / clang 10?

@wkcn
Copy link
Member Author

wkcn commented Jun 7, 2020

Hi @leezu , the compiler I used is the latest version of gcc, namely gcc 10.1.0.
indexing_op.o is the only file which takes a long time and over 16GB memory to build.
I have not yet tried clang.

@woreom
Copy link

woreom commented Jun 8, 2020

I believe maybe I could help since clearly we are having the same problem, I don't have a notion about cross-compiling but I have access to 16GB+ mem computer

@wkcn
Copy link
Member Author

wkcn commented Jun 8, 2020

I remember that it takes fewer than 8GB memory to build the eldder version of MXNet.
In the latest version, most files still takes fewer than 4GB memory, but only a few of files (e.g. indexing_op) takes more than 16GB memory (building by g++ 10.1.0).

If we can reduce the cost of memory, it is helpful for building MXNet on laptop computer and edge machine, which own less than 8GB/16GB memory.

@woreom
Copy link

woreom commented Jun 8, 2020

I remember that it takes fewer than 8GB memory to build the eldder version of MXNet.
In the latest version, most files still takes fewer than 4GB memory, but only a few of files (e.g. indexing_op) takes more than 16GB memory (building by g++ 10.1.0).

If we can reduce the cost of memory, it is helpful for building MXNet on laptop computer and edge machine, which own less than 8GB/16GB memory.

Only 4 files take more than 8GB

@leezu
Copy link
Contributor

leezu commented Jun 8, 2020

I think we should consider this a release-critical bug. @woreom @wkcn did you try if this affects the 1.7 / 1.x branches as well?

cc: @ciyongch

@wkcn
Copy link
Member Author

wkcn commented Jun 9, 2020

@leezu sorry that I did not check 1.7 anx 1.x branches.

@leezu
Copy link
Contributor

leezu commented Jun 10, 2020

There seem to be some more issues. In certain build configuration with llvm 7, many of the numpy object files blow up

875M    build/CMakeFiles/mxnet.dir/src/operator/tensor/broadcast_reduce_norm_value.cc.o
918M    build/CMakeFiles/mxnet.dir/src/operator/numpy/np_elemwise_broadcast_logic_op.cc.o
1.2G    build/CMakeFiles/mxnet.dir/src/operator/numpy/np_where_op.cc.o
1.9G    build/CMakeFiles/mxnet.dir/src/operator/numpy/np_broadcast_reduce_op_value.cc.o
2.1G    build/CMakeFiles/mxnet.dir/src/operator/numpy/linalg/np_norm_forward.cc.o

@ciyongch
Copy link
Contributor

Hi @leezu, @wkcn , as this is only a build issue when building MXNet from source on some certain machines (installed with small memory < 16GB), I suggest not to tag it a block issue for 1.7.0 and consider to include the fix if it's available before the release happened.
User still can install MXNet via binary release/nightly image or increase the virtual memory of their build machine as a workaround.
What do you think?

@woreom
Copy link

woreom commented Jun 11, 2020

it would be great if you could make a prebuild that works on a raspberry pi with armv7 because I tried to build all versions from 1.2.1 to 1.6.0 and failed.

@wkcn
Copy link
Member Author

wkcn commented Jun 11, 2020

Hi @ciyongch , I agree that we don't need to tag it a block issue, and the issue can be fixed after MXNet 1.7 releases.

After the problem addressed, we can backport the PR to 1.7.x branch.

@wkcn
Copy link
Member Author

wkcn commented Jun 11, 2020

Hi @woreom , could you please create a issue about requesting pre-build MXNet on ARM?

MXNet consists of ARM build and test (#18264 , #18058 ). I don't know whether the pre-build package will be released.

@woreom
Copy link

woreom commented Jun 11, 2020

@wkcn #18471 I did but @leezu closed it. I will open another one

@ciyongch
Copy link
Contributor

I agree that we don't need to tag it a block issue, and the issue can be fixed after MXNet 1.7 releases.

After the problem addressed, we can backport the PR to 1.7.x branch.

Thanks for your confirm @wkcn :)

@wkcn
Copy link
Member Author

wkcn commented Jun 11, 2020

@woreom It seems that the pre-built MXNet 1.5 package will not be uploaded because of ASF licensing policy, but pre-built MXNet 1.7 and 2.0+ on ARM may be uploaded.

Before that, you can try the naive build or cross-compiling, following the instruction: https://mxnet.apache.org/get_started?platform=devices&iot=raspberry-pi&

@leezu
Copy link
Contributor

leezu commented Jun 11, 2020

I disagree. Official MXNet releases are source releases. At this point in time, there exist 0 compliant binary releases.
It's very important that we don't introduce regressions that prevent users from building MXNet.

I didn't check if this is present in 1.7, but if it is, it certainly is a release blocker in my opinion. Note that this is probably a regression due the work on mxnet 2. It's not acceptable to introduce such regressions in the 1.x series.

@leezu
Copy link
Contributor

leezu commented Jun 11, 2020

I measure the overall memory consumption during compilation using linux control group feature. https://github.com/gsauthof/cgmemtime

Results are

v1.7.x
Child user: 7658.352 s
Child sys : 263.657 s
Child wall: 199.661 s
Child high-water RSS : 1952024 KiB
Recursive and acc. high-water RSS+CACHE : 54680084 KiB

v1.6.x
Child user: 5758.186 s
Child sys : 222.487 s
Child wall: 131.241 s
Child high-water RSS : 2040712 KiB
Recursive and acc. high-water RSS+CACHE : 45344596 KiB

v1.5.x
Child user: 3800.705 s
Child sys : 143.353 s
Child wall: 112.121 s
Child high-water RSS : 1604820 KiB
Recursive and acc. high-water RSS+CACHE : 37374300 KiB

ccache is always cleaned between compilations. Results obtained with:

CC=gcc-7 CXX=g++-7 cmake -GNinja -DUSE_CUDA=0 ..
cgmemtime ninja

This is preliminary in that it measures parallel compilation, thus memory usage is very high. Overall there's a 44% increase from 1.5

@leezu
Copy link
Contributor

leezu commented Jun 12, 2020

Doing a single-process build of 1.7.x branch (ninja -j1) just costs around 2 GB memory at maximum.

Child user: 4167.479 s
Child sys : 159.497 s
Child wall: 4327.964 s
Child high-water RSS : 1952008 KiB
Recursive and acc. high-water RSS+CACHE : 2155568 KiB

@wkcn
Copy link
Member Author

wkcn commented Jun 12, 2020

I'm trying to use ninja to build MXNet 2.0 (the master branch) on my laptop computer (16 GB mem + 8GB virtual mem). I will update the log later.

cmake  -GNinja -DUSE_CUDA=0 ..
cgmemtime ninja

I run ninja two times since the building was interrupted, and the second time continues the building.
(gcc 10.1.0, i7-7500u (2 cores 4 threads), MXNet(master, 1bf881f))

Child user: 3692.505 s
Child sys :  177.550 s
Child wall: 1017.096 s
Child high-water RSS                    :    1852208 KiB
Recursive and acc. high-water RSS+CACHE :    3877980 KiB

Child user: 13315.378 s
Child sys :  353.862 s
Child wall: 3847.364 s
Child high-water RSS                    :   11402844 KiB
Recursive and acc. high-water RSS+CACHE :   12226040 KiB

@leezu
Copy link
Contributor

leezu commented Jun 12, 2020

Thanks @wkcn. I'll report the same with gcc7. You are using gcc10 right?

@wkcn
Copy link
Member Author

wkcn commented Jun 12, 2020

@leezu
Yes, gcc 10.1.0, i7-7500u (2 cores 4 threads), MXNet(master, 1bf881f)

@leezu
Copy link
Contributor

leezu commented Jun 12, 2020

Single process build of MXNet master with gcc7 gives the following results:

Child user: 5288.372 s
Child sys :  188.645 s
Child wall: 5481.062 s
Child high-water RSS                    :    2504976 KiB
Recursive and acc. high-water RSS+CACHE :    2674692 KiB

That's a 24% increase to 1.7, but less than 3GB high-water. So I don't think we have any blocking issue here. @wkcn I suggest you reduce the number of parallel builds to stay under 16GB. Also recommend to use ccache to avoid rebuilding.

@leezu leezu changed the title It costs over 16GB memory to compile indexing_op.o It costs ~3GB memory to compile indexing_op.o Jun 12, 2020
@wkcn

This comment has been minimized.

@wkcn
Copy link
Member Author

wkcn commented Jun 13, 2020

Hi @leezu , I found the cause.
The cost of memory depends on the compiler and the building method (ninja or make)
I build indexing_op.o by ninja with different version of gcc.

Compiler The cost of memory(Child high-water RSS)
g++ 6.4.1 1.95 GB
g++ 7.4.1 1.78 GB
g++ 10.1.0 11 GB

Besides, since the compiler flags is different in different building ways (for example Makefile enable -funroll-loops, it will takes more memory), the cost of memory is different.

@wkcn
Copy link
Member Author

wkcn commented Jun 13, 2020

Hi @leezu @woreom @ciyongch , I have found the cause.

The cause is related to the compiler. g++ 10 takes over 11 GB memory to build indexing_op.o, but g++ 6 and 7 take less than 2 GB.

The solution is to build MXNet with g++-6 or g++-7.

Thanks for your help!

@wkcn wkcn closed this as completed Jun 15, 2020
@leezu leezu changed the title It costs ~3GB memory to compile indexing_op.o gcc10 memory usage regression for compiling indexing_op.o Jun 15, 2020
@leezu
Copy link
Contributor

leezu commented Jun 15, 2020

@wkcn thank you for investigating this. The regression in gcc is quite serious. Would you check if there is a report at https://gcc.gnu.org/bugs/ and potentially open a new bug report? Eventually gcc10 will be shipped by default on many platforms and this issue may affect more users later.

@leezu leezu reopened this Jun 15, 2020
@wkcn
Copy link
Member Author

wkcn commented Jun 15, 2020

@leezu Sorry that I do not know how to find the bug report in https://gcc.gnu.org/bugs/

@leezu
Copy link
Contributor

leezu commented Jun 15, 2020

@wkcn the bugtracker is linked on the page. It's https://gcc.gnu.org/bugzilla/

@wkcn
Copy link
Member Author

wkcn commented Jun 16, 2020

@leezu Thank you! I guess that the bug is memory leak of the compiler gcc 10.1.0.

@leezu leezu changed the title gcc10 memory usage regression for compiling indexing_op.o gcc8+ memory usage regression for compiling indexing_op.o Jun 24, 2020
@leezu
Copy link
Contributor

leezu commented Jun 24, 2020

According to #15393 (comment) the leak already occurs with gcc8

@woreom

This comment has been minimized.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants