Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Running problem with conda installed deepMD toolkit #1446

Closed
halohyx opened this issue Jan 24, 2022 · 2 comments
Closed

[BUG] Running problem with conda installed deepMD toolkit #1446

halohyx opened this issue Jan 24, 2022 · 2 comments
Labels

Comments

@halohyx
Copy link

halohyx commented Jan 24, 2022

Dear DeepMD developers,
I installed deepMD in server by the method provided by the easy-install method provided by the deepMD official account https://github.com/deepmodeling/deepmd-kit/blob/master/doc/install/easy-install.md#with-conda, the command I was using is listed as below:
conda create -n deepmd_tst deepmd-kit=2.0.0=*gpu libdeepmd=2.0.0=*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org

And later I tested by "dp -h" command and the output seems that the deepMD was installed correctly:
2022-01-24 20:17:04.096599: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd/lib/python3.9/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
_bootstrap._exec(spec, module)
usage: dp [-h] [--version] {config,transfer,train,freeze,test,compress,doc-train-input,model-devi,convert-from} ...

DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics

optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit

Valid subcommands:
{config,transfer,train,freeze,test,compress,doc-train-input,model-devi,convert-from}
config fast configuration of parameter file for smooth model
transfer pass parameters to another model
train train a model
freeze freeze the model
test test the model
compress compress a model
doc-train-input print the documentation (in rst format) of input training parameters.
model-devi calculate model deviation
convert-from convert lower model version to supported version

But during the use of deepMD, I tested the official water example by using "dp train water.json", but unluckily I got the below result:
2022-01-24 19:49:52.461464: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
_bootstrap._exec(spec, module)
/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/common.py:334: UserWarning: the key n_neuron is deprecated, please use fitting_neuron instead
warnings.warn(f"the key {ii} is deprecated, please use {key} instead")
/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/utils/compat.py:50: UserWarning: It seems that you are using a deepmd-kit input of version 0.x.x, which is deprecated. we have converted the input to >2.0.0 compatible
warnings.warn(msg)
2022-01-24 19:50:04.562682: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-24 19:50:04.566883: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-24 19:50:04.881041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:b7:00.0 name: Tesla V100-SXM3-32GB computeCapability: 7.0
coreClock: 1.597GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 913.62GiB/s
2022-01-24 19:50:04.881215: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
2022-01-24 19:50:04.888855: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2022-01-24 19:50:04.888972: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2022-01-24 19:50:04.894870: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-24 19:50:04.896524: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-24 19:50:04.901345: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2022-01-24 19:50:04.903826: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2022-01-24 19:50:04.925130: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.7
2022-01-24 19:50:04.937177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-01-24 19:50:04.937262: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
2022-01-24 19:50:07.387275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-01-24 19:50:07.387406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2022-01-24 19:50:07.387430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2022-01-24 19:50:07.416026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9774 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM3-32GB, pci bus id: 0000:b7:00.0, compute capability: 7.0)
2022-01-24 19:50:07.416709: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2022-01-24 19:50:07.433709: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2700000000 Hz
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 24,25,30,72,73,78
OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 6 available OS procs
OMP: Info #158: KMP_AFFINITY: Uniform topology
OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 3 cores/socket x 2 threads/core (3 total cores)
OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 1 core 0 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 72 maps to socket 1 core 0 thread 1
OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 1 core 1 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 73 maps to socket 1 core 1 thread 1
OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 1 core 8 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 78 maps to socket 1 core 8 thread 1
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248784 thread 1 bound to OS proc set 25
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248787 thread 2 bound to OS proc set 30
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248788 thread 3 bound to OS proc set 72
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248789 thread 4 bound to OS proc set 73
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248790 thread 5 bound to OS proc set 78
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248791 thread 6 bound to OS proc set 24
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248785 thread 7 bound to OS proc set 25
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248792 thread 8 bound to OS proc set 30
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248793 thread 9 bound to OS proc set 72
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248794 thread 10 bound to OS proc set 73
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248795 thread 11 bound to OS proc set 78
OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248796 thread 12 bound to OS proc set 24
DEEPMD INFO training data with min nbor dist: 0.8763010118574123
DEEPMD INFO training data with max nbor size: [38, 72]
Traceback (most recent call last):
File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/bin/dp", line 10, in
sys.exit(main())
File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
train_dp(**dict_args)
File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 91, in train
jdata = update_sel(jdata)
File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 341, in update_sel
descrpt_data = update_one_sel(jdata, descrpt_data)
File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 318, in update_one_sel
if parse_auto_sel(descriptor['sel']) :
KeyError: 'sel'

Because I didn't change anything after the installation, and I also tried to install some other versions by changing the specifications in the conda install command, still the same error showed, so can you give me some suggestion about how to solve it?
I appreciate a lot for you time!

@halohyx halohyx added the bug label Jan 24, 2022
@njzjz
Copy link
Member

njzjz commented Jan 24, 2022

Hi, this bug has been fixed in #1253. For this version, I suggest not to use local frame descriptor, but use se_e2_a instead. See https://docs.deepmodeling.org/projects/deepmd/en/v2.0.0/model/overall.html for details.

@halohyx
Copy link
Author

halohyx commented Jan 25, 2022

Thanks alot! After I changed the descriptor to se_e2_a, the problem got fixed, thanks again for your quick reply and your valueble time~

@halohyx halohyx closed this as completed Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants