Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lattice_dyson_g0_wk hangs when inside MPI #48

Open
andrewkhardy opened this issue Oct 1, 2024 · 3 comments
Open

lattice_dyson_g0_wk hangs when inside MPI #48

andrewkhardy opened this issue Oct 1, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@andrewkhardy
Copy link

andrewkhardy commented Oct 1, 2024

I'm currently running into problems where sometimes this function seems to hang when called many times with MPI.
Inside a bash script for example, I run the following commands (hoping to get OpenMP not to interfere with a MPI loop)

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

mpirun python openMP_test_case.py

where a minimal working example is

import numpy as np
from triqs_tprf.tight_binding import TBLattice
from triqs_tprf.lattice import *
from triqs_tprf.lattice_utils import *
from triqs.gf import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
Variable_1_array = np.linspace(0.0, 1.0, 11)
Variable_2_array = np.linspace(0.0, 1.0, 11)
grid = np.array(np.meshgrid(Variable_1_array, Variable_2_array)).T.reshape(-1, 2)
chunks = np.array_split(grid, mpi.size)
chunk = comm.scatter(chunks, root=0)

for c in chunk:
	var_1, var_2 = np.round(c, 4)

	t = 0.1 + var_1*var_2
	H = TBLattice(
	    units = [(1, 0, 0), (0, 1, 0)],
	    hopping = {
	        # nearest neighbour hopping -t
	        ( 0,+1): -t * np.eye(2),
	        ( 0,-1): -t * np.eye(2),
	        (+1, 0): -t * np.eye(2),
	        (-1, 0): -t * np.eye(2),
	        },
	    orbital_positions = [(0,0,0)]*2,
	    orbital_names = ['up', 'do'],
	    )
	e_k = H.fourier(H.get_kmesh(n_k=(60, 60, 1)))
	DLRmesh = MeshDLRImFreq(beta=100, statistic='Fermion', w_max=10, eps=1e-14)
	G_c_wk = lattice_dyson_g0_wk(mu=0.0, e_k=e_k, mesh=DLRmesh)
	print("Finished for loop iteration")
print("FINISHED RUNNING ?")

which hangs and never completes. I need to perform operations (like saving data) after this loop, so need to understand how to prevent this hanging.

(this issue also occurs without DLR mesh)

by the way,
I also get a warning when calling lattice_dyson_g0_wk which is probably also worth fixing.
DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)

@andrewkhardy andrewkhardy added the bug Something isn't working label Oct 1, 2024
@andrewkhardy andrewkhardy changed the title Deprication Warning in lattice_dyson_g0_wk lattice_dyson_g0_wk hangs when inside MPI Oct 2, 2024
@andrewkhardy
Copy link
Author

Staring at the C++, source code, it appears that the jankiest fix would be to do the following _nompi fix, but it would be nice if you did not need to call a separate function? Maybe that is too ambitious.

g_fk_t lattice_dyson_g0_fk_nompi(double mu, e_k_cvt e_k, mesh::refreq mesh, double delta) {


  auto I = nda::eye<ek_vt::scalar_t>(e_k.target_shape()[0]);
  g_fk_t g0_fk({mesh, e_k.mesh()}, e_k.target_shape());
  g0_fk() = 0.0;
  std::complex<double> delta(0.0, delta);

  auto arr = g0_fk.mesh();
#pragma omp parallel for
  for (int idx = 0; idx < arr.size(); idx++) {
    auto &[f, k] = arr[idx];
    g0_fk[f, k]  = inverse((f + idelta + mu) * I - e_k[k]);
  }

  return g0_fk;
}
g_wk_t lattice_dyson_g0_wk_nompi(double mu, e_k_cvt e_k, mesh::imfreq mesh) {
  return lattice_dyson_g0_Xk_nompi<g_wk_t, mesh::imfreq>(mu, e_k, mesh);
}

@HugoStrand
Copy link
Member

Dear Andrew,

Thank you for the bug report. I am not sure what is going wrong but I think it is an interference between the TPRF mpi behaviour and your MPI use for scattering parameters to the MPI ranks.

Please note that the MPI parallellization in TPRF assumes that all ranks call the lattice_dyson_g0_wk, the second assumption is that all ranks give the same input parameters to the function.

I think both of these reqirements are violated by the example. First the hopping is not the same for all ranks since they work on different points in the meshgrid. Second I think that if the number of elements are not evenly divided by the number of MPI ranks there will be a subset of ranks that have a different length of the chunk variable. Since the number of lattice_dyson_g0_wk calls each rank does is equal to the length of their respective chunk variable, this will cause the program to hang, since the assumption is that all MPI ranks makes the same number of calls to the function.

Could you please try if this simplified example also hangs?

import numpy as np
from triqs_tprf.tight_binding import TBLattice
from triqs_tprf.lattice import *
from triqs_tprf.lattice_utils import *
from triqs.gf import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if True:
	t = 0.1
	H = TBLattice(
	    units = [(1, 0, 0), (0, 1, 0)],
	    hopping = {
	        # nearest neighbour hopping -t
	        ( 0,+1): -t * np.eye(2),
	        ( 0,-1): -t * np.eye(2),
	        (+1, 0): -t * np.eye(2),
	        (-1, 0): -t * np.eye(2),
	        },
	    orbital_positions = [(0,0,0)]*2,
	    orbital_names = ['up', 'do'],
	    )
	e_k = H.fourier(H.get_kmesh(n_k=(60, 60, 1)))
	DLRmesh = MeshDLRImFreq(beta=100, statistic='Fermion', w_max=10, eps=1e-14)
	G_c_wk = lattice_dyson_g0_wk(mu=0.0, e_k=e_k, mesh=DLRmesh)
	print("Finished for loop iteration")

print("FINISHED RUNNING ?")

Best, Hugo

@andrewkhardy
Copy link
Author

Dear Hugo,

Thank you for a quick response. The example you provided does not hang, although it also does not accomplish what I am hoping to achieve. I would like to iterate over points in parameter space (such as density and hopping say) and then perform calculations with the lattice Green's function. I need the outer MPI loop to do that (not present in your example) and the MPI inside the function seems to interfere with that outer loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants