Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocating multiple blocks to one mpi rank in LB #5026

Open
wants to merge 27 commits into
base: python
Choose a base branch
from

Conversation

hidekb
Copy link

@hidekb hidekb commented Jan 10, 2025

Description of changes:

  • LB CPU now supports allocating multiple blocks to one mpi rank
    • The default block number per mpi rank is 1
    • The block number per mpi rank is controlled by the argument blocks_per_mpi_rank for LBFluidWalberla

@hidekb hidekb marked this pull request as draft January 10, 2025 18:01
maintainer/benchmarks/lb.py Outdated Show resolved Hide resolved
if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) {
throw std::runtime_error(
"GPU architecture PROHIBITED allocating many blocks to 1 CPU.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Using more than one block per MPI rank is not supported for GPU LB" (but why, actually?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "GPU LB only uses 1 block per MPI rank"?

@@ -96,7 +96,7 @@ class BoundaryPackInfo : public PackInfo<GhostLayerField_T> {
WALBERLA_ASSERT_EQUAL(bSize, buf_size);
#endif

auto const offset = std::get<0>(m_lattice->get_local_grid_range());
auto const offset = to_vector3i(receiver->getAABB().min());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to have functions for this in the Lattice class, so they can be used by EK as well?
After all, LB and EK probably need to agree about both, the mpi and the block decompositoin?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the free function to_vector3i and added the member function of LatticeWalberla LatticeWalberla::get_block_corner.

}

auto constexpr lattice_constant = real_t{1};
auto const cells_block = Utils::hadamard_division(grid_dimensions, node_grid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cells_per_block?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed cells_block to cells_per_block.

// number of cells per block in each direction
uint_c(cells_block[0]), uint_c(cells_block[1]), uint_c(cells_block[2]),
lattice_constant,
// number of cpus per direction
uint_c(node_grid[0]), uint_c(node_grid[1]), uint_c(node_grid[2]),
// periodicity
true, true, true);
true, true, true,
// keep global block information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do/mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "keep global block information" is true, each process keeps information about remote blocks that reside on other processes.

return v * u


LB_PARAMS = {'agrid': 1.,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls avoid 1s in unit tests, as wrong exponents don't get cauth.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this function.

@RudolfWeeber
Copy link
Contributor

Thank you. Looks good in general.
One thing worth looking into is the cell interval business. It looks like there are a lot of very similar loops in the getters and setters for slices. Would it it be possible to pull that out into a function, which then is called with different lambdas for the individual cases?
@jngrad could you maybe take a look?

@hidekb
Copy link
Author

hidekb commented Jan 15, 2025

Thank you. Looks good in general. One thing worth looking into is the cell interval business. It looks like there are a lot of very similar loops in the getters and setters for slices. Would it it be possible to pull that out into a function, which then is called with different lambdas for the individual cases?

I rewrote the code related to the getters and setters for slices. Similar loops are pulled into a function which calls different lambdas for the individual cases.

maintainer/benchmarks/lb.py Outdated Show resolved Hide resolved

"""
Benchmark Lattice-Boltzmann fluid + Lennard-Jones particles.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more sustainable if we only had one LB benchmark file to maintain? argparse is quite flexible, surely we can come up with a way to select strong vs. weak scaling with command line options?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unified lb.py and lb_weakscaling.py by adding option --weak_scaling to argparse for lb.py

src/python/espressomd/lb.py Outdated Show resolved Hide resolved
Comment on lines 142 to 143
auto const blocks_per_mpi_rank = get_value_or<Utils::Vector3i>(
params, "blocks_per_mpi_rank", Utils::Vector3i{{1, 1, 1}});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, do we really need a default value, considering we already provide a default value in the python class?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added default value for the python class LatticeWalberla. And blocks_per_mpi_rank in LBFluid is settle by get_value<Utils::Vector3i>(m_lattice->get_parameter("blocks_per_mpi_rank")).

if (blocks_per_mpi_rank != Utils::Vector3i{{1, 1, 1}}) {
throw std::runtime_error(
"GPU architecture PROHIBITED allocating many blocks to 1 CPU.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "GPU LB only uses 1 block per MPI rank"?

src/walberla_bridge/src/utils/types_conversion.hpp Outdated Show resolved Hide resolved
testsuite/python/lb.py Outdated Show resolved Hide resolved
testsuite/python/lb_couette_xy.py Outdated Show resolved Hide resolved
@@ -41,7 +41,7 @@ class LBMassCommon:

"""Check the lattice-Boltzmann mass conservation."""

system = espressomd.System(box_l=[3.0, 3.0, 3.0])
system = espressomd.System(box_l=[6.0, 6.0, 6.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several mass conservation tests are know to take a lot of time, especially in code coverage builds. The runtime can significantly increase when multiple CI jobs run on the same runner, which is then starved of resources. This change may potentially increase the test runtime by a factor 8. Can you please confirm the runtime did not significantly change in the clang and coverage CI jobs, compared to the python branch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the box_l from 6 to 4 to reduce the runtime. For testing with blocks_per_mpir_rank(i.e. [1,1,2]), box_l = 4 is needed.

testsuite/python/lb_shear.py Outdated Show resolved Hide resolved
@hidekb hidekb marked this pull request as ready for review January 23, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants