Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSGrid segfaults for large systems and small cutoff #3183

Closed
zemanj opened this issue Mar 21, 2021 · 10 comments · Fixed by #3441
Closed

NSGrid segfaults for large systems and small cutoff #3183

zemanj opened this issue Mar 21, 2021 · 10 comments · Fixed by #3441

Comments

@zemanj
Copy link
Member

zemanj commented Mar 21, 2021

Expected behavior

FastNS never segfaults, no matter how strange its input may be.

Actual behavior

FastNS segfaults if the box is large and the cutoff small.
The reason is that the number of grid cells is not limited in nsgrid.pyx.

Code to reproduce the behavior

import MDAnalysis as mda
from MDAnalysisTests.datafiles import SURFACE_PDB
from MDAnalysis.lib.nsgrid import FastNS

u = mda.Universe(SURFACE_PDB)
box = u.dimensions
positions = u.atoms.positions

# universe inflation :-)
factor = 350
box[:3] *= factor
positions *= factor

ns = FastNS(cutoff=2.9, coords=positions, box=box, pbc=True)

Output:

Segmentation fault (core dumped)

Current version of MDAnalysis

  • Which version are you using? 1.0.2-dev0
  • Which version of Python? Python 3.8.5
  • Which operating system? Ubuntu 20.04.2 LTS
@zemanj
Copy link
Member Author

zemanj commented Mar 21, 2021

IMHO we need an intelligent solution for this, not just a simple cap on the number of grid cells.
The maximum number of grid cells could be based on the expected number of atoms per cell (i.e., number density) assuming a homogeneous system (I guess that's the best we can do).
Should the number of cells still be too large (memory issues / overflow of cell ids), one should start decreasing the number of cells in the direction with the most cells first because that has the least relative impact.

@orbeckst
Copy link
Member

orbeckst commented May 6, 2021

@richardjgowers @zemanj is this issue fixed in 1.1.1/2.0.0?

@zemanj
Copy link
Member Author

zemanj commented May 15, 2021

Nope, the issue still persists (at least on develop).

@IAlibay
Copy link
Member

IAlibay commented May 15, 2021

can confirm is still segfaults on 1.1.1 too

@IAlibay
Copy link
Member

IAlibay commented May 15, 2021

Probably to add to the conversation here, there's a really interesting thing where if you put the cutoff to either 2 or 3.5 you get the following ValueError:

ValueError                                Traceback (most recent call last)
~/software/anaconda/python3.6/2019.7/envs/mda-1.1.1/lib/python3.8/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
    340         fill_value = asarray(fill_value)
    341         dtype = fill_value.dtype
--> 342     a = empty(shape, dtype, order)
    343     multiarray.copyto(a, fill_value, casting='unsafe')
    344     return a

ValueError: negative dimensions are not allowed
Exception ignored in: 'MDAnalysis.lib.nsgrid.FastNS._pack_grid'
Traceback (most recent call last):
  File "/biggin/b131/bioc1523/software/anaconda/python3.6/2019.7/envs/mda-1.1.1/lib/python3.8/site-packages/numpy/core/numeric.py", line 342, in full
    a = empty(shape, dtype, order)
ValueError: negative dimensions are not allowed

@scal444
Copy link
Contributor

scal444 commented Oct 14, 2021

Segfault is in initial _pack_grid call, at

self.next_id[i] = self.head_id[j]

It may be that
j = self.coord2cellid(&self.coords_bbox[i][0])

is giving something bad, but the grid sizes seemed weird, so there may be an upstream issue. I can dig into this

@scal444
Copy link
Contributor

scal444 commented Oct 15, 2021

OK I believe this is integer overflow. The inflated setup with the given cutoff leads to more cells than the limits of 32-bit integers. The overflow math leads to incorrectly small structures like head_id.

@scal444
Copy link
Contributor

scal444 commented Oct 15, 2021

I started converting most of the ints in the FastNS cython file to long long, but then stopped. I think it's not reasonable to run this algorithm with these given input and cutoff dimensions - for a system with 100 particles, we're attempting to build a grid with over 5 billion elements. Additionally, c subroutines like _pbc_ortho have ints as their API boundaries, so we have the possibility of truncating anyway.

I suggest a bounds check to make sure the grid dimensions are within the range of int32 arithmetic, with a message suggesting increasing the cutoff distance if the input is going to give us out-of-bounds grid sizes. Further down the road, we could improve heuristics around brute-force vs nsgrid algorithm selection.

@richardjgowers
Copy link
Member

what's the box size and cutoff combination you are using to trigger this?

I think an easy fix for this is just to limit the number of cells in any direction to cube root(INT32_MAX).

@scal444
Copy link
Contributor

scal444 commented Oct 16, 2021

I used the original bug description to reproduce. Your suggested fix is what I had in mind as well.

scal444 pushed a commit to scal444/mdanalysis that referenced this issue Oct 16, 2021
The NSGrid cython code uses integer indexing, which is typically
32 bit. 3D grids with small cutoffs or giant box sizes can have
grids > 1000 per dimension, which leads to integer overflow when
trying to index the grids.

Fixes MDAnalysis#3183
scal444 pushed a commit to scal444/mdanalysis that referenced this issue Oct 26, 2021
The NSGrid cython code uses integer indexing, which is typically
32 bit. 3D grids with small cutoffs or giant box sizes can have
grids > 1000 per dimension, which leads to integer overflow when
trying to index the grids.

Fixes MDAnalysis#3183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants