Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nbor list sorting error in lammps with the compressed model #773

Closed
zezhong-zhang opened this issue Jun 19, 2021 · 4 comments
Closed
Assignees
Labels

Comments

@zezhong-zhang
Copy link

zezhong-zhang commented Jun 19, 2021

Summary

Using the compressed model in Lammps with multiple GPUs leads to "illegal nbor list sorting" error, single GPU does not have this issue.

Deepmd-kit version, installation way, input file, running commands, error log, etc.
System: CentOS Linux 7 (Core) with slurm
deepmd-kit: 2.0.0.b0 py39_0_cuda10.1_gpu deepmodeling/label/dev
lammps-dp: 2.0.0.b0 0_cuda10.1_gpu deepmodeling/label/dev
python: 3.9.4 hdb3f193_0
installation: conda 4.10.1
command: srun -n 16 lmp -in in.lammps
Input & output file including:
in.lammps
graph.pb (model not compressed)
graph-compress.pb (after compression)
log for single GPU
log for multiple GPU with srun
log for multiple GPU with mpirun
the model training parameters
g6_sub.lammps -- this is a small test structure
hex_loop_2_new.lammps -- this is a large structure

Archive.zip

Steps to Reproduce

  1. srun -n 16 lmp -in in.lammps with the compressed model will yield illegal nbor list sorting, so does mpirun
  2. lmp -in in.lammps with compressed model and single GPU can run.
  3. srun -n 16 lmp -in in.lammps with the model not compressed and multiple GPUs can also run
  4. But in all cases, the output (both mc and md) does not update in the log while dump is working.

Further Information, Files, and Links
For the large structure, I have 58673 atoms in the box and run with 16 V100 GPUs. Running with a not compressed model will give CUDA out of memory error. I am wondering what would be a good estimation for the number of GPUs/atom?

@zezhong-zhang zezhong-zhang changed the title [BUG] Sorting error in lammps with the compressed model [BUG] Nbor list sorting error in lammps with the compressed model Jun 19, 2021
@denghuilu
Copy link
Member

denghuilu commented Jun 20, 2021

Could you provide the training data mentioned in the input.json? We want to try different model compression parameters

@denghuilu
Copy link
Member

denghuilu commented Jun 21, 2021

Actually, similar problem can also be found in the original model. As a quick fix, we suggest to set nlist freq to 1 to fix the problem in the original model. We are fixing the problem in model compression as soon as possible.

@dfz05
Copy link

dfz05 commented Jun 21, 2021

I meet the same problem during MC simulation by using the uncompressed model. When only running MD, the simulation is OK. However, when running a MC+MD, lammps fails randomly and reports the error "illegal nbor list sorting".
The code I used is :Deepmd-kit standalone 2.0.0.beta0

@njzjz
Copy link
Member

njzjz commented Jun 30, 2021

Fixed in #812.

@njzjz njzjz closed this as completed Jun 30, 2021
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Sep 21, 2023
Since there are lots of parameters still not added (deepmodeling#771, deepmodeling#773, deepmodeling#774,
deepmodeling#781, deepmodeling#782), we only enable loose check for `dpgen run` (i.e.
`strict_check=False`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants