Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix python cmake error #976

Merged
merged 1 commit into from
Aug 17, 2021
Merged

Conversation

denghuilu
Copy link
Member

@denghuilu denghuilu commented Aug 16, 2021

In the local deepmd-kit python interface compilation, an error occurs with export DP_VARIANT=cuda:

root deepmd-kit $ export DP_VARIANT=cuda
root deepmd-kit $ 
root deepmd-kit $ pip install .
Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
Processing /root/dp-devel/deepmd-kit
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Requirement already satisfied: numpy in /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==2.0.0b5.dev23+g68577be) (1.19.5)
Requirement already satisfied: typing-extensions in /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==2.0.0b5.dev23+g68577be) (3.7.4.3)
Requirement already satisfied: dargs>=0.2.6 in /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==2.0.0b5.dev23+g68577be) (0.2.6)
Requirement already satisfied: python-hostlist>=1.21 in /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==2.0.0b5.dev23+g68577be) (1.21)
Requirement already satisfied: pyyaml in /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==2.0.0b5.dev23+g68577be) (5.4.1)
Requirement already satisfied: scipy in /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==2.0.0b5.dev23+g68577be) (1.5.4)
Building wheels for collected packages: deepmd-kit
  Building wheel for deepmd-kit (PEP 517) ... done
  Created wheel for deepmd-kit: filename=deepmd_kit-2.0.0b5.dev23+g68577be-cp36-cp36m-linux_x86_64.whl size=1609118 sha256=9ce3b0c24bfc8bef6712030585f692171da6a1c847df387966a5b77df700f30f
  Stored in directory: /root/.cache/pip/wheels/d7/0d/2a/b6f1653be91ad8dd465c1560d36da90e510c5d488bc4fd6563
Successfully built deepmd-kit
Installing collected packages: deepmd-kit
  Attempting uninstall: deepmd-kit
    Found existing installation: deepmd-kit 2.0.0b5.dev23+g68577be.d20210816
    Uninstalling deepmd-kit-2.0.0b5.dev23+g68577be.d20210816:
      Successfully uninstalled deepmd-kit-2.0.0b5.dev23+g68577be.d20210816
Successfully installed deepmd-kit-2.0.0b5.dev23+g68577be
root deepmd-kit $ 
root deepmd-kit $ dp -h
WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Traceback (most recent call last):
  File "/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/env.py", line 176, in get_module
    module = tf.load_op_library(str(module_file))
  File "/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 57, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/op/libop_abi.so: undefined symbol: _ZN6deepmd23prod_env_mat_a_gpu_cudaIfEEvPT_S2_S2_PiPKS1_PKiRKNS_10InputNlistES3_PyiS5_S5_iiffSt6vectorIiSaIiEE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/dp-devel/tensorflow_venv/bin/dp", line 5, in <module>
    from deepmd.entrypoints.main import main
  File "/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/__init__.py", line 3, in <module>
    import deepmd.utils.network as network
  File "/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/utils/__init__.py", line 2, in <module>
    from .data import DeepmdData
  File "/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/utils/data.py", line 10, in <module>
    from deepmd.env import GLOBAL_NP_FLOAT_PRECISION
  File "/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/env.py", line 253, in <module>
    op_module = get_module("libop_abi")
  File "/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/deepmd/env.py", line 224, in get_module
    )) from e
RuntimeError: This deepmd-kit package is inconsitent with TensorFlowRuntime, thus an error is raised when loading libop_abi.You need to rebuild deepmd-kit against this TensorFlowruntime.

This problem is caused by the incorrect linking of LIB_DEEPMD_OP_DEVICE within the $deepmd_source_dir/source/op/CMakeLists.txt

Change explanation:
The usage requirements of a target can transitively propagate to dependents. The target_link_libraries() command has PRIVATE, INTERFACE and PUBLIC keywords to control the propagation.

Generally, a dependency should be specified in a use of target_link_libraries() with the PRIVATE keyword if it is used by only the implementation of a library, and not in the header files. If a dependency is additionally used in the header files of a library (e.g. for class inheritance), then it should be specified as a PUBLIC dependency. A dependency which is not used by the implementation of a library, but only by its headers should be specified as an INTERFACE dependency.

When compile with the cuda/rocm environment, the shared library LIB_DEEPMD will link the LIB_DEEPMD_OP_DEVICE. However, the LIB_DEEPMD uses the LIB_DEEPMD_OP_DEVICE in it's header files not in it's source files. So an INTERFACE argument is more proper than PRIVATE. And this will solve the above problem.

@denghuilu denghuilu requested review from njzjz and galeselee August 16, 2021 02:19
@codecov-commenter
Copy link

Codecov Report

Merging #976 (dad0be0) into devel (68577be) will decrease coverage by 11.08%.
The diff coverage is n/a.

❗ Current head dad0be0 differs from pull request most recent head a6aeeab. Consider uploading reports for the commit a6aeeab to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##            devel     #976       +/-   ##
===========================================
- Coverage   75.37%   64.28%   -11.09%     
===========================================
  Files          85        5       -80     
  Lines        6801       14     -6787     
===========================================
- Hits         5126        9     -5117     
+ Misses       1675        5     -1670     
Impacted Files Coverage Δ
deepmd/utils/type_embed.py
deepmd/infer/deep_dipole.py
deepmd/utils/__init__.py
deepmd/fit/polar.py
deepmd/descriptor/__init__.py
source/op/_prod_force_se_r_grad.py
deepmd/utils/data_system.py
deepmd/train/trainer.py
deepmd/loggers/__init__.py
deepmd/utils/tabulate.py
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68577be...a6aeeab. Read the comment docs.

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right in theory, but sadly, I cannot reproduce the error...

@denghuilu
Copy link
Member Author

I think you are right in theory, but sadly, I cannot reproduce the error...

I can get this error in both Aliyun and my local workstation ( •́ω•̩̥̀ )

@njzjz
Copy link
Member

njzjz commented Aug 16, 2021

I add -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON to see the detailed command. On the devel branch, my link command generated by CMake is

&& /usr/local/bin/g++750 -fPIC -std=c++11 -Wno-ignored-attributes -fopenmp -O3 -DNDEBUG   -shared -Wl,-soname,libop_abi.so -o op/libop_abi.so op/CMakeFiles/op_abi.dir/custom_op.cc.o op/CMakeFiles/op_abi.dir/descrpt.cc.o op/CMakeFiles/op_abi.dir/descrpt_se_a_ef.cc.o op/CMakeFiles/op_abi.dir/descrpt_se_a_ef_para.cc.o op/CMakeFiles/op_abi.dir/descrpt_se_a_ef_vert.cc.o op/CMakeFiles/op_abi.dir/ewald_recp.cc.o op/CMakeFiles/op_abi.dir/gelu_multi_device.cc.o op/CMakeFiles/op_abi.dir/map_aparam.cc.o op/CMakeFiles/op_abi.dir/neighbor_stat.cc.o op/CMakeFiles/op_abi.dir/pair_tab.cc.o op/CMakeFiles/op_abi.dir/prod_env_mat_multi_device.cc.o op/CMakeFiles/op_abi.dir/prod_force.cc.o op/CMakeFiles/op_abi.dir/prod_force_multi_device.cc.o op/CMakeFiles/op_abi.dir/prod_virial.cc.o op/CMakeFiles/op_abi.dir/prod_virial_multi_device.cc.o op/CMakeFiles/op_abi.dir/soft_min.cc.o op/CMakeFiles/op_abi.dir/soft_min_force.cc.o op/CMakeFiles/op_abi.dir/soft_min_virial.cc.o op/CMakeFiles/op_abi.dir/tabulate_multi_device.cc.o op/CMakeFiles/op_abi.dir/unaggregated_grad.cc.o op/CMakeFiles/op_abi.dir/__/lib/src/SimulationRegion.cpp.o op/CMakeFiles/op_abi.dir/__/lib/src/neighbor_list.cc.o  -Wl,-rpath,/home/jz748/codes/deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-build/lib:/home/jz748/anaconda3/envs/dpdev/lib/python3.8/site-packages/tensorflow:/home/jz748/codes/deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-build/lib/src/cuda:  lib/libdeepmd.so  /home/jz748/anaconda3/envs/dpdev/lib/python3.8/site-packages/tensorflow/libtensorflow_framework.so.2  -Wl,-rpath-link,/home/jz748/codes/deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-build/lib/src/cuda && :

I think the key point is -Wl,-rpath-link,/home/jz748/codes/deepmd-kit/_skbuild/linux-x86_64-3.8/cmake-build/lib/src/cuda, where -Wl,-rpath-link means

When using ELF or SunOS, one shared library may require another. This happens when an ld -shared link includes a shared library as one of the input files. When the linker encounters such a dependency when doing a non-shared, non-relocateable link, it will automatically try to locate the required shared library and include it in the link, if it is not included explicitly. In such a case, the -rpath-link option specifies the first set of directories to search. The -rpath-link option may specify a sequence of directory names either by specifying a list of names separated by colons, or by appearing multiple times. The linker uses the following search paths to locate required shared libraries.

(https://ftp.gnu.org/old-gnu/Manuals/ld-2.9.1/html_chapter/ld_2.html)

So, what's your command?

@denghuilu
Copy link
Member Author

Before this PR, the command would be:

/usr/bin/c++ -fPIC  -std=c++11 -Wno-ignored-attributes -fopenmp -O3 -DNDEBUG  -shared -Wl,-soname,libop_abi.so -o libop_abi.so CMakeFiles/op_abi.dir/custom_op.cc.o CMakeFiles/op_abi.dir/prod_force.cc.o CMakeFiles/op_abi.dir/prod_virial.cc.o CMakeFiles/op_abi.dir/descrpt.cc.o CMakeFiles/op_abi.dir/descrpt_se_a_ef.cc.o CMakeFiles/op_abi.dir/descrpt_se_a_ef_para.cc.o CMakeFiles/op_abi.dir/descrpt_se_a_ef_vert.cc.o CMakeFiles/op_abi.dir/pair_tab.cc.o CMakeFiles/op_abi.dir/prod_force_multi_device.cc.o CMakeFiles/op_abi.dir/prod_virial_multi_device.cc.o CMakeFiles/op_abi.dir/soft_min.cc.o CMakeFiles/op_abi.dir/soft_min_force.cc.o CMakeFiles/op_abi.dir/soft_min_virial.cc.o CMakeFiles/op_abi.dir/ewald_recp.cc.o CMakeFiles/op_abi.dir/gelu_multi_device.cc.o CMakeFiles/op_abi.dir/map_aparam.cc.o CMakeFiles/op_abi.dir/neighbor_stat.cc.o CMakeFiles/op_abi.dir/unaggregated_grad.cc.o CMakeFiles/op_abi.dir/tabulate_multi_device.cc.o CMakeFiles/op_abi.dir/prod_env_mat_multi_device.cc.o CMakeFiles/op_abi.dir/__/lib/src/SimulationRegion.cpp.o CMakeFiles/op_abi.dir/__/lib/src/neighbor_list.cc.o -Wl,-rpath,/root/dp-devel/deepmd-kit/_skbuild/linux-x86_64-3.6/cmake-build/lib:/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow: ../lib/libdeepmd.so /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.2

After this PR, the command would be:

/usr/bin/c++ -fPIC  -std=c++11 -Wno-ignored-attributes -fopenmp -O3 -DNDEBUG  -shared -Wl,-soname,libop_abi.so -o libop_abi.so CMakeFiles/op_abi.dir/custom_op.cc.o CMakeFiles/op_abi.dir/prod_force.cc.o CMakeFiles/op_abi.dir/prod_virial.cc.o CMakeFiles/op_abi.dir/descrpt.cc.o CMakeFiles/op_abi.dir/descrpt_se_a_ef.cc.o CMakeFiles/op_abi.dir/descrpt_se_a_ef_para.cc.o CMakeFiles/op_abi.dir/descrpt_se_a_ef_vert.cc.o CMakeFiles/op_abi.dir/pair_tab.cc.o CMakeFiles/op_abi.dir/prod_force_multi_device.cc.o CMakeFiles/op_abi.dir/prod_virial_multi_device.cc.o CMakeFiles/op_abi.dir/soft_min.cc.o CMakeFiles/op_abi.dir/soft_min_force.cc.o CMakeFiles/op_abi.dir/soft_min_virial.cc.o CMakeFiles/op_abi.dir/ewald_recp.cc.o CMakeFiles/op_abi.dir/gelu_multi_device.cc.o CMakeFiles/op_abi.dir/map_aparam.cc.o CMakeFiles/op_abi.dir/neighbor_stat.cc.o CMakeFiles/op_abi.dir/unaggregated_grad.cc.o CMakeFiles/op_abi.dir/tabulate_multi_device.cc.o CMakeFiles/op_abi.dir/prod_env_mat_multi_device.cc.o CMakeFiles/op_abi.dir/__/lib/src/SimulationRegion.cpp.o CMakeFiles/op_abi.dir/__/lib/src/neighbor_list.cc.o -Wl,-rpath,/root/dp-devel/deepmd-kit/_skbuild/linux-x86_64-3.6/cmake-build/lib:/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow:/root/dp-devel/deepmd-kit/_skbuild/linux-x86_64-3.6/cmake-build/lib/src/cuda: ../lib/libdeepmd.so /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so.2 ../lib/src/cuda/libdeepmd_op_cuda.so /usr/local/cuda-11.0/lib64/libcudart_static.a -lpthread -ldl /usr/lib/x86_64-linux-gnu/librt.so

The difference is whether the libdeepmd_op_cuda.so was used.

@njzjz
Copy link
Member

njzjz commented Aug 16, 2021

Maybe it's a changed behavior in cmake...

@denghuilu
Copy link
Member Author

Actually, it's about the ninja-build package. When the Ninja package is not installed in the system, the "Ninja" generator of cmake will fail. At this time, the "Unix Makefiles" generator will be enabled by default; And the two generators have inconsistent behaviors for the PRIVATE parameter in the target_link_library function.

@amcadmus amcadmus merged commit a1914ea into deepmodeling:devel Aug 17, 2021
gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021
njzjz added a commit to njzjz/deepmd-kit that referenced this pull request May 13, 2022
In deepmodeling#976, propagate was changed from PRIVATE to PUBLIC. However, `libdeepmd` actually includes the CUDA libraries in the `gelu.cc` file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants