Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising differences between nnpops and torchani #82

Closed
wiederm opened this issue Feb 12, 2023 · 17 comments · Fixed by #83
Closed

Surprising differences between nnpops and torchani #82

wiederm opened this issue Feb 12, 2023 · 17 comments · Fixed by #83
Labels
help wanted Extra attention is needed

Comments

@wiederm
Copy link

wiederm commented Feb 12, 2023

Hi,

I have been getting surprising results running waterbox simulations with the torchani vs the nnpops implementation of ani2x. I used the openmmtools waterbox testsystem with an edge length of 20 A, and a 1 fs timestep, and simulated for 1 ns using a Langevin integrator with 1/ps collision rate. The system was set up with potential.createSystem().

When I run simulations in NpT with the torchani implementation at 300 K everything looks relatively normal (density is a bit too high, and the rdf has some surprising signal though):
image
When I perform the same simulation with the nnpops implementation I see this:
image
and the simulation box has shrunk (the initial box size is the yellow outlined square). Also, note the difference in the y-axis for the potential energy.
image

In NVT I observe vacuum bubbles with nnpos
https://user-images.githubusercontent.com/31651017/218335894-0254ed80-e51f-4189-9bfc-ae94637cfd85.mp4

compared to the same simulation with torchani
https://user-images.githubusercontent.com/31651017/218335817-3e911757-d19d-4f71-b922-8b9de913237e.mp4

I attached a minimal example to reproduce the simulations.

min_example.py.zip

@wiederm
Copy link
Author

wiederm commented Feb 13, 2023

The relevant packages in my environment are

openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-torch              1.0             cuda112py310hdb05021_0    conda-forge
openmmml                  1.0                      pypi_0    pypi
openmmtools               0.21.4             pyhd8ed1ab_0    conda-forge
nnpops                    0.2             cuda112py310h85a0d14_4    conda-forge

@raimis raimis added the help wanted Extra attention is needed label Feb 13, 2023
@raimis
Copy link
Contributor

raimis commented Feb 13, 2023

@RaulPPelaez could you try to reproduce and verify the issue?

@RaulPPelaez
Copy link
Contributor

Out of curiosity, what are you using for visualization?

@RaulPPelaez
Copy link
Contributor

Could you try running with nnpops=0.3?

@wiederm
Copy link
Author

wiederm commented Feb 14, 2023

I am using nglview in combination with ipywidgets.

I have started simulations with nnpops=0.3, will update in a few hours with the results.

@wiederm
Copy link
Author

wiederm commented Feb 14, 2023

Just looking at the first few ps simulation time, I see qualitatively the exact same behavior in NpT and NVT with ani2x with nnpops=0.3:
image

Relevant packages:

openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-torch              1.0             cuda112py310hbd91edb_0    conda-forge
openmmml                  1.0                      pypi_0    pypi
openmmtools               0.21.5             pyhd8ed1ab_0    conda-forge
nnpops                    0.3             cuda112py310hd4d1af5_2    conda-forge

@wiederm
Copy link
Author

wiederm commented Feb 15, 2023

with the updated nnpos=0.3, simulations explode with torchani that were running fine previously (with nnpops they run as described above). The following script runs fine with nnpops=0.2, but if I upgrade to nnpops=0.3 temperature increases significantly and hydrogen-oxygen bonds are broken.

This is a minimum example to reproduce the issue:

from openmm import unit, LangevinIntegrator, Platform
from openmm.app import Simulation, StateDataReporter
from mdtraj.reporters import HDF5Reporter
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
from openmm.unit import Quantity
from sys import stdout

waterbox = WaterBox(15 * unit.angstrom, constrained=False)
potential = MLPotential('ani2x')
system = potential.createSystem(waterbox.topology, implementation='torchani', removeConstraints=True)
####################
# define simulation parameters
stepsize = Quantity(1., unit.femto * unit.seconds)
collision_rate = 1 / Quantity(1, unit.pico * unit.second)
temperature = Quantity(300, unit.kelvin)

integrator = LangevinIntegrator(
    temperature, collision_rate, stepsize
)

platform = Platform.getPlatformByName('CUDA')
simulation = Simulation(waterbox.topology, system, integrator,platform=platform, platformProperties={'Precision':'mixed'})
simulation.context.setPositions(waterbox.positions)

simulation.reporters.append(
    HDF5Reporter(
        "./tmp.h5",
        10,
    )
)
simulation.reporters.append(StateDataReporter(stdout, 10, step=True, potentialEnergy=True, temperature=True))
simulation.step(5_000)

@RaulPPelaez
Copy link
Contributor

I am able to reproduce this. We believe it might be a problem with periodic boundary conditions in nnpops.

@RaulPPelaez
Copy link
Contributor

I compared the norms of the forces provided by both implementations using this code:

from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np

box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CUDA")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
    system = potential.createSystem(
        testsystem.topology, implementation=s
    )
    integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
    simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
    simulation.context.setPositions(testsystem.positions)
    forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
    positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)

fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)
with open("fnorms.dat", 'w') as f:
    for p,i,j in zip(positions, fnorms_nnpops, fnorms_torchani):
        f.write(f"{p.x} {p.y} {p.z} {i} {j}\n")

error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)

Then, I take all the particles with height near the center of the domain (z\in [0.55:0.95] nm) and plot a heatmap of the relative error in the force norm:
image

This could be consistent with a bug in the PBC

@RaulPPelaez
Copy link
Contributor

I turned off one by one each of the optimized components here:

# Optimize the components of an ANI model
self.species_converter = TorchANISpeciesConverter(model.species_converter, atomicNumbers)
self.aev_computer = TorchANISymmetryFunctions(model.species_converter, model.aev_computer, atomicNumbers)
self.neural_networks = TorchANIBatchedNN(model.species_converter, model.neural_networks, atomicNumbers)
self.energy_shifter = TorchANIEnergyShifter(model.species_converter, model.energy_shifter, atomicNumbers)

The error arises when using the aev_computer, replacing the line by:

self.aev_computer = model.aev_computer

Results in relative error around machine precision as one would expect.

@wiederm
Copy link
Author

wiederm commented Feb 22, 2023

Thanks for tracking this down! I will give it a try!

@RaulPPelaez
Copy link
Contributor

Many of the CUDA tests in NNPOps fail for me, @peastman @raimis, could you confirm?

Test project /shared/raul/NNPOps/build
      Start  1: TestCpuANISymmetryFunctions
 1/13 Test  #1: TestCpuANISymmetryFunctions ......   Passed    1.50 sec
      Start  2: TestCpuCFConv
 2/13 Test  #2: TestCpuCFConv ....................   Passed    0.48 sec
      Start  3: TestCudaANISymmetryFunctions
 3/13 Test  #3: TestCudaANISymmetryFunctions .....Subprocess aborted***Exception:   1.80 sec
      Start  4: TestCudaCFConv
 4/13 Test  #4: TestCudaCFConv ...................Subprocess aborted***Exception:   1.82 sec
      Start  5: TestBatchedNN


 5/13 Test  #5: TestBatchedNN ....................***Failed  203.79 sec
      Start  6: TestCFConv
 6/13 Test  #6: TestCFConv .......................   Passed    5.28 sec
      Start  7: TestCFConvNeighbors
 7/13 Test  #7: TestCFConvNeighbors ..............   Passed    3.38 sec
      Start  8: TestEnergyShifter
 8/13 Test  #8: TestEnergyShifter ................   Passed  108.41 sec
      Start  9: TestOptimizedTorchANI
 9/13 Test  #9: TestOptimizedTorchANI ............***Failed  216.59 sec
      Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter .............   Passed  111.40 sec
      Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............***Failed  129.94 sec
      Start 12: TestNeighbors
12/13 Test #12: TestNeighbors ....................***Exception: SegFault  5.57 sec
      Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs .............   Passed    2.47 sec

54% tests passed, 6 tests failed out of 13

Total Test time (real) = 792.56 sec

The following tests FAILED:
          3 - TestCudaANISymmetryFunctions (Subprocess aborted)
          4 - TestCudaCFConv (Subprocess aborted)
          5 - TestBatchedNN (Failed)
          9 - TestOptimizedTorchANI (Failed)
         11 - TestSymmetryFunctions (Failed)
         12 - TestNeighbors (SEGFAULT)

@sef43
Copy link
Contributor

sef43 commented Feb 24, 2023

Mine mostly pass:

      Start  1: TestCpuANISymmetryFunctions
 1/13 Test  #1: TestCpuANISymmetryFunctions ......   Passed    2.22 sec
      Start  2: TestCpuCFConv
 2/13 Test  #2: TestCpuCFConv ....................   Passed    0.87 sec
      Start  3: TestCudaANISymmetryFunctions
 3/13 Test  #3: TestCudaANISymmetryFunctions .....   Passed    2.65 sec
      Start  4: TestCudaCFConv
 4/13 Test  #4: TestCudaCFConv ...................   Passed    3.03 sec
      Start  5: TestBatchedNN
 5/13 Test  #5: TestBatchedNN ....................***Failed  155.24 sec
      Start  6: TestCFConv
 6/13 Test  #6: TestCFConv .......................***Failed    4.81 sec
      Start  7: TestCFConvNeighbors
 7/13 Test  #7: TestCFConvNeighbors ..............   Passed    4.92 sec
      Start  8: TestEnergyShifter
 8/13 Test  #8: TestEnergyShifter ................   Passed   97.31 sec
      Start  9: TestOptimizedTorchANI
 9/13 Test  #9: TestOptimizedTorchANI ............   Passed  163.42 sec
      Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter .............   Passed  105.67 sec
      Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............   Passed  120.57 sec
      Start 12: TestNeighbors
12/13 Test #12: TestNeighbors ....................   Passed    9.12 sec
      Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs .............   Passed    2.79 sec

85% tests passed, 2 tests failed out of 13

Total Test time (real) = 672.64 sec

The following tests FAILED:
	  5 - TestBatchedNN (Failed)
	  6 - TestCFConv (Failed)

The failures are just a couple of floating point assertion errors that might be stochastic

This was using a build environment created with NNPOPs/environment.yaml but with cudatoolkit=11.7 to match my installed cuda

@sef43
Copy link
Contributor

sef43 commented Feb 24, 2023

I think these lines in SymmetryFunctions.cpp

            if (device.is_cpu()) {
                impl = std::make_shared<CpuANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
                                                                                                 ^^^^^

#ifdef ENABLE_CUDA
            } else if (device.is_cuda()) {
                // PyTorch allow to chose GPU with "torch.device", but it doesn't set as the default one.
                CHECK_CUDA_RESULT(cudaSetDevice(device.index()));
                impl = std::make_shared<CudaANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
                                                                                                 ^^^^^
#endif
            } else

mean that the periodic flag is hard coded and always set to false and the templated <PERIODIC> displacement/distance calculations in CpuANISymmetryFunctions.cpp / CudaANISymmetryFunctions.cu are never actually used. Or am I wrongly understanding how the .cpp/.cu code is called from the python?

@peastman
Copy link
Member

I believe you're correct about that. If we replace false with cellPtr != nullptr, that should pass the correct value for whether to use periodic boundary conditions.

@RaulPPelaez
Copy link
Contributor

Amazing catch @sef43 ! @peastman's suggestion fixes the issue. The following test now passes:

from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np

box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CPU")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
    system = potential.createSystem(
        testsystem.topology, implementation=s
    )
    print(f"Implementation {s}")
    file=open(f"{s}.dat", 'w')
    integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
    simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
    simulation.context.setPositions(testsystem.positions)
    forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
    positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)

fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)

error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)
print(f"Maximum error: {np.max(error)}")
print(f"Mean error: {np.mean(error)}")
print(f"Std error: {np.std(error)}")

Printing:

Maximum error: 1.1337166912735413e-05
Mean error: 1.2383139611804328e-06
Std error: 1.5981482745315712e-06

@wiederm
Copy link
Author

wiederm commented Mar 2, 2023

this is great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants