Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of chunk balancer feature #1775

Merged
merged 10 commits into from
Oct 9, 2021
3 changes: 2 additions & 1 deletion AUTHORS
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,5 @@ Alec Hammond <[email protected]>
Yidong Chong <[email protected]>
Robin Dunn <[email protected]>
Ian Williamson <[email protected]>
Andreas Hoenselaar <[email protected]>
Andreas Hoenselaar <[email protected]>
Ben Bartlett <[email protected]>
133 changes: 133 additions & 0 deletions doc/docs/Parallel_Meep.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,3 +169,136 @@ See also [FAQ/Should I expect linear speedup from the parallel Meep](FAQ.md#shou
<center>
![](images/parallel_benchmark_DFT.png)
</center>


Dynamic Chunk Balancing
-----------------------

Since Meep's computation time is ultimately bottlenecked by the slowest-running process, it is desirable to load-balance the chunk layout such that each compute node is given an equal workload. By default, Meep uses a heuristics-based scheme to estimate the cost of a computation region based on the composition of voxel types (anisotropic, PML, etc.) However, this method of estimating the simulation time for a chunk in advance is not always accurate, a problem which is especially true for simulations run on shared-resource clusters. Meep's `chunk_balancer` module allows for a more empirical, data-driven approach for dynamically load-balancing parallel simulations. This approach uses the simulation time per node to adaptively modify the chunk layout and it implicitly corrects for heterogeneity over the simulation region and variability in background loads and job priority on shared compute resources. This approach can be especially useful for cases such as adjoint optimization, where slight variations on the same simulation are run many times over many iterations.

<center>
![](images/adaptive_chunk_layout.gif)
</center>

### Chunk balancer interface

The chunk balancer interface provides four main methods:
- `make_initial_chunk_layout()` generates the initial chunk layout for the first iteration of a simulation. By default, `None` is returned, indicating Meep should use its default chunk partitioning strategy.
- `should_adjust_chunk_layout()` decides whether the current layout is sufficiently poorly balanced to justify the up-front cost of reallocating the field arrays when changing chunk layouts.
- `compute_new_chunk_layout()` looks at the current chunk layout and per-process timing data to compute a new chunk layout which has chunk sizes adjusted to improve the load balance across compute nodes.
- `adjust_chunk_layout()` is syntactic sugar which will compute a new chunk layout, apply it to the simulation object, reset the simulation, and re-initialize the simulation.

```py
class AbstractChunkBalancer(abc.ABC):
"""Chunk balancer for dynamically load-balanced chunk layouts."""

def make_initial_chunk_layout(self, sim) -> mp.BinaryPartition:
"""Generates an initial chunk layout for simulation."""

def should_adjust_chunk_layout(self, sim) -> bool:
"""Is current layout imbalanced enough to justify rebuilding sim?"""

@abc.abstractmethod
def compute_new_chunk_layout(
self,
timing_measurements: MeepTimingMeasurements,
old_chunk_layout: mp.BinaryPartition,
chunk_volumes: Tuple[mp.grid_volume],
chunk_owners: np.ndarray) -> mp.BinaryPartition:
"""Rebalance chunks to equalize simulation time for each node."""

def adjust_chunk_layout(self, sim, **kwargs) -> None:
"""Computes a new chunk layout and applies it to sim."""
```

### Usage

Using the chunk balancer is very straightforward, and it can typically be integrated into existing Meep simulations with only a few lines of code. Here is a simple example:

```py
from meep.chunk_balancer import ChunkBalancer

chunk_balancer = ChunkBalancer()

# Compute an initial chunk layout
initial_chunk_layout = chunk_balancer.make_initial_chunk_layout()

sim = mp.Simulation(..., chunk_layout=initial_chunk_layout)
sim.init_sim()

for iteration in range(epochs):
sim.run(...)
# Adjust the chunk layout for the next iteration if needed
chunk_balancer.adjust_chunk_layout(sim, sensitivity=0.4)
```
bencbartlett marked this conversation as resolved.
Show resolved Hide resolved

Chunks can also be rebalanced between runs of a program by dumping and loading the chunk layout from a pickled object. For example:

```py
import meep as mp
from meep.chunk_balancer import ChunkBalancer
from meep.timing_measurements import MeepTimingMeasurements
import pickle

# Fetch chunk layout from a previous run
with open("path/to/chunk_layout.pkl", "rb") as f:
chunk_layout = pickle.load(f)
bencbartlett marked this conversation as resolved.
Show resolved Hide resolved

sim = mp.Simulation(..., chunk_layout=chunk_layout)
sim.init_sim()
sim.run(...)

# Compute and save chunk layout for next run
timings = MeepTimingMeasurements.new_from_simulation(sim)
chunk_volumes = sim.structure.get_chunk_volumes()
chunk_owners = sim.structure.get_chunk_owners()
next_chunk_layout = ChunkBalancer().compute_new_chunk_layout(
timings,
chunk_layout,
chunk_volumes,
chunk_owners,
sensitivity=0.4)

# Save chunk layout for next run
with open("path/to/chunk_layout.pkl", "wb") as f:
pickle.dump(next_chunk_layout, f)
```

### Chunk adjustment algorithm

The default chunk adjustment algorithm recursively traverses the `BinaryPartition` object and resizes the chunk volumes of the left and right children of each node to have an equal per-node simulation time. The new chunk layout has split positions which are a weighted average of the previous iteration's chunk layout and the newly computed layout, and the `sensitivity` parameter adjusts how fast the chunk sizes are adjusted. (`sensitivity=0.0` means no adjustment, `sensitivity=1.0` means an immediate snap to the predicted split positions, and `sensitivity=0.5` is the average of the old and new layouts.) The adjustment algorithm is summarized in pseudocode below:

```
def adjust_split_pos(node):
for subtree in {node.left, node.right}:
V := Σ volume for nodes in subtree
t := Σ sim time for nodes in subtree
n := number of processes in subtree
l := t / n # average load per process

Vₗ ↦ Vₗ / lₗ
Vᵣ ↦ Vᵣ / lᵣ

split_pos’ := dₘᵢₙ + (dₘₐₓ - dₘᵢₙ) * (Vₗ) / (Vₗ + Vᵣ)

# Adjust with sensitivity parameter
node.split_pos ↦ s * split_pos’ + (1-s) * node.split_pos

# Recurse through rest of the tree
adjust_split_pos(node.left)
adjust_split_pos(node.right)
```

### Load balancing results on shared clusters

The following benchmarking results show load-balancing improvements from a parallel run on a shared compute cluster in a datacenter. The per-core working times (blue and orange) start of initially poorly balanced using the default `split_by_cost` scheme, but the load balancing improves over successive iterations.

<center>
![](images/chunk_balancer_timing_stats.gif)
</center>

Using the normalized standard deviation in simulation times per iteration as a proxy for load-balancing efficacy, we can see that the load balance improves over a large number of runs with varying numbers of processes:

<center>
![](images/chunk_balancer_variance.jpg)
</center>
Binary file added doc/docs/images/adaptive_chunk_layout.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/docs/images/chunk_balancer_timing_stats.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/docs/images/chunk_balancer_variance.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions python/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,11 @@ TESTS = \
$(TEST_DIR)/test_antenna_radiation.py \
$(TEST_DIR)/test_array_metadata.py \
$(TEST_DIR)/test_bend_flux.py \
$(TEST_DIR)/test_binary_partition_utils.py \
$(BINARY_GRATING_TEST) \
$(TEST_DIR)/test_cavity_arrayslice.py \
$(TEST_DIR)/test_cavity_farfield.py \
$(TEST_DIR)/test_chunk_balancer.py \
$(TEST_DIR)/test_chunk_layout.py \
$(TEST_DIR)/test_chunks.py \
$(TEST_DIR)/test_cyl_ellipsoid.py \
Expand Down Expand Up @@ -85,6 +87,7 @@ TESTS = \
$(TEST_DIR)/test_simulation.py \
$(TEST_DIR)/test_special_kz.py \
$(TEST_DIR)/test_source.py \
$(TEST_DIR)/test_timing_measurements.py \
$(TEST_DIR)/test_user_defined_material.py \
$(TEST_DIR)/test_visualization.py \
$(WVG_SRC_TEST)
Expand Down Expand Up @@ -205,9 +208,12 @@ endif # MAINTAINER_MODE
# specification of python source files to be byte-compiled at installation
######################################################################
HL_IFACE = \
$(srcdir)/binary_partition_utils.py \
$(srcdir)/chunk_balancer.py \
$(srcdir)/geom.py \
$(srcdir)/simulation.py \
$(srcdir)/source.py \
$(srcdir)/timing_measurements.py \
$(srcdir)/visualization.py \
$(srcdir)/materials.py \
$(srcdir)/verbosity_mgr.py
Expand Down
Loading