lattice_dyson_g0_wk hangs when inside MPI #48

andrewkhardy · 2024-10-01T14:29:52Z

I'm currently running into problems where sometimes this function seems to hang when called many times with MPI.
Inside a bash script for example, I run the following commands (hoping to get OpenMP not to interfere with a MPI loop)

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

mpirun python openMP_test_case.py

where a minimal working example is

import numpy as np
from triqs_tprf.tight_binding import TBLattice
from triqs_tprf.lattice import *
from triqs_tprf.lattice_utils import *
from triqs.gf import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
Variable_1_array = np.linspace(0.0, 1.0, 11)
Variable_2_array = np.linspace(0.0, 1.0, 11)
grid = np.array(np.meshgrid(Variable_1_array, Variable_2_array)).T.reshape(-1, 2)
chunks = np.array_split(grid, mpi.size)
chunk = comm.scatter(chunks, root=0)

for c in chunk:
	var_1, var_2 = np.round(c, 4)

	t = 0.1 + var_1*var_2
	H = TBLattice(
	    units = [(1, 0, 0), (0, 1, 0)],
	    hopping = {
	        # nearest neighbour hopping -t
	        ( 0,+1): -t * np.eye(2),
	        ( 0,-1): -t * np.eye(2),
	        (+1, 0): -t * np.eye(2),
	        (-1, 0): -t * np.eye(2),
	        },
	    orbital_positions = [(0,0,0)]*2,
	    orbital_names = ['up', 'do'],
	    )
	e_k = H.fourier(H.get_kmesh(n_k=(60, 60, 1)))
	DLRmesh = MeshDLRImFreq(beta=100, statistic='Fermion', w_max=10, eps=1e-14)
	G_c_wk = lattice_dyson_g0_wk(mu=0.0, e_k=e_k, mesh=DLRmesh)
	print("Finished for loop iteration")
print("FINISHED RUNNING ?")

which hangs and never completes. I need to perform operations (like saving data) after this loop, so need to understand how to prevent this hanging.

(this issue also occurs without DLR mesh)

by the way,
I also get a warning when calling lattice_dyson_g0_wk which is probably also worth fixing.
DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)

The text was updated successfully, but these errors were encountered:

andrewkhardy · 2024-10-02T18:03:34Z

Staring at the C++, source code, it appears that the jankiest fix would be to do the following _nompi fix, but it would be nice if you did not need to call a separate function? Maybe that is too ambitious.

g_fk_t lattice_dyson_g0_fk_nompi(double mu, e_k_cvt e_k, mesh::refreq mesh, double delta) {


  auto I = nda::eye<ek_vt::scalar_t>(e_k.target_shape()[0]);
  g_fk_t g0_fk({mesh, e_k.mesh()}, e_k.target_shape());
  g0_fk() = 0.0;
  std::complex<double> delta(0.0, delta);

  auto arr = g0_fk.mesh();
#pragma omp parallel for
  for (int idx = 0; idx < arr.size(); idx++) {
    auto &[f, k] = arr[idx];
    g0_fk[f, k]  = inverse((f + idelta + mu) * I - e_k[k]);
  }

  return g0_fk;
}
g_wk_t lattice_dyson_g0_wk_nompi(double mu, e_k_cvt e_k, mesh::imfreq mesh) {
  return lattice_dyson_g0_Xk_nompi<g_wk_t, mesh::imfreq>(mu, e_k, mesh);
}

HugoStrand · 2024-10-03T11:56:24Z

Dear Andrew,

Thank you for the bug report. I am not sure what is going wrong but I think it is an interference between the TPRF mpi behaviour and your MPI use for scattering parameters to the MPI ranks.

Please note that the MPI parallellization in TPRF assumes that all ranks call the lattice_dyson_g0_wk, the second assumption is that all ranks give the same input parameters to the function.

I think both of these reqirements are violated by the example. First the hopping is not the same for all ranks since they work on different points in the meshgrid. Second I think that if the number of elements are not evenly divided by the number of MPI ranks there will be a subset of ranks that have a different length of the chunk variable. Since the number of lattice_dyson_g0_wk calls each rank does is equal to the length of their respective chunk variable, this will cause the program to hang, since the assumption is that all MPI ranks makes the same number of calls to the function.

Could you please try if this simplified example also hangs?

import numpy as np
from triqs_tprf.tight_binding import TBLattice
from triqs_tprf.lattice import *
from triqs_tprf.lattice_utils import *
from triqs.gf import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if True:
	t = 0.1
	H = TBLattice(
	    units = [(1, 0, 0), (0, 1, 0)],
	    hopping = {
	        # nearest neighbour hopping -t
	        ( 0,+1): -t * np.eye(2),
	        ( 0,-1): -t * np.eye(2),
	        (+1, 0): -t * np.eye(2),
	        (-1, 0): -t * np.eye(2),
	        },
	    orbital_positions = [(0,0,0)]*2,
	    orbital_names = ['up', 'do'],
	    )
	e_k = H.fourier(H.get_kmesh(n_k=(60, 60, 1)))
	DLRmesh = MeshDLRImFreq(beta=100, statistic='Fermion', w_max=10, eps=1e-14)
	G_c_wk = lattice_dyson_g0_wk(mu=0.0, e_k=e_k, mesh=DLRmesh)
	print("Finished for loop iteration")

print("FINISHED RUNNING ?")

Best, Hugo

andrewkhardy · 2024-10-03T12:59:20Z

Dear Hugo,

Thank you for a quick response. The example you provided does not hang, although it also does not accomplish what I am hoping to achieve. I would like to iterate over points in parameter space (such as density and hopping say) and then perform calculations with the lattice Green's function. I need the outer MPI loop to do that (not present in your example) and the MPI inside the function seems to interfere with that outer loop.

andrewkhardy added the bug Something isn't working label Oct 1, 2024

andrewkhardy changed the title ~~Deprication Warning in lattice_dyson_g0_wk~~ lattice_dyson_g0_wk hangs when inside MPI Oct 2, 2024

Wentzell assigned HugoStrand Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lattice_dyson_g0_wk hangs when inside MPI #48

lattice_dyson_g0_wk hangs when inside MPI #48

andrewkhardy commented Oct 1, 2024 •

edited

Loading

andrewkhardy commented Oct 2, 2024

HugoStrand commented Oct 3, 2024

andrewkhardy commented Oct 3, 2024

lattice_dyson_g0_wk hangs when inside MPI #48

lattice_dyson_g0_wk hangs when inside MPI #48

Comments

andrewkhardy commented Oct 1, 2024 • edited Loading

andrewkhardy commented Oct 2, 2024

HugoStrand commented Oct 3, 2024

andrewkhardy commented Oct 3, 2024

andrewkhardy commented Oct 1, 2024 •

edited

Loading