Skip to content

Commit

Permalink
Implementation of chunk balancer feature (#1775)
Browse files Browse the repository at this point in the history
* Implementation of chunk balancer feature

CONTEXT: This feature uses empirical data about the timings of each MPI process to adjust the chunk sizes to achieve optimal load balancing for repeated simulations, such as iterations in an optimization run. Processes which are consistently slow and overworked will have their chunk sizes reduced for the next iteration of an optimization run, and underworked processes will handle larger chunks. This approach has the advantage that it implicitly incorporates variable load between different machines running the MPI processes.

SCOPE:
- Added abstract class for chunk balancer
- Added DefaultChunkBalancer implementation, which adjusts chunk sizes according to per-process working time (ignoring all-all comms) while maintaining previous split directions
- Wrote unit tests for DefaultChunkBalancer to check for improvement in load-balancing, convergence to a load-balanced state, and that split_pos values are adjusted correctly for each iteration.
- Wrote unit tests for the MockSimulation class used by the DefaultChunkBalancerTests
- Added a binary_partition_utils.py library which includes lots of tree traversal algorithms which are useful to have for the chunk balancer
- Wrote unit tests for binary_partition_utils.py

* fixed wrong test filename in Makefile.am

* add new modules to Makefile.am to be included in __init__.py

* bugfix in tests

* more test bugfixes

* avoid merge conflict

* refactored class names, added documentation for adjusting chunk layouts between runs of a program

* update to Parallel_Meep.md for missing chunk layout file

* test bugfix
  • Loading branch information
bencbartlett authored Oct 9, 2021
1 parent cc626e2 commit d12e544
Show file tree
Hide file tree
Showing 12 changed files with 1,641 additions and 1 deletion.
3 changes: 2 additions & 1 deletion AUTHORS
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,5 @@ Alec Hammond <[email protected]>
Yidong Chong <[email protected]>
Robin Dunn <[email protected]>
Ian Williamson <[email protected]>
Andreas Hoenselaar <[email protected]>
Andreas Hoenselaar <[email protected]>
Ben Bartlett <[email protected]>
136 changes: 136 additions & 0 deletions doc/docs/Parallel_Meep.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,3 +169,139 @@ See also [FAQ/Should I expect linear speedup from the parallel Meep](FAQ.md#shou
<center>
![](images/parallel_benchmark_DFT.png)
</center>


Dynamic Chunk Balancing
-----------------------

Since Meep's computation time is ultimately bottlenecked by the slowest-running process, it is desirable to load-balance the chunk layout such that each compute node is given an equal workload. By default, Meep uses a heuristics-based scheme to estimate the cost of a computation region based on the composition of voxel types (anisotropic, PML, etc.) However, this method of estimating the simulation time for a chunk in advance is not always accurate, a problem which is especially true for simulations run on shared-resource clusters. Meep's `chunk_balancer` module allows for a more empirical, data-driven approach for dynamically load-balancing parallel simulations. This approach uses the simulation time per node to adaptively modify the chunk layout and it implicitly corrects for heterogeneity over the simulation region and variability in background loads and job priority on shared compute resources. This approach can be especially useful for cases such as adjoint optimization, where slight variations on the same simulation are run many times over many iterations.

<center>
![](images/adaptive_chunk_layout.gif)
</center>

### Chunk balancer interface

The chunk balancer interface provides four main methods:
- `make_initial_chunk_layout()` generates the initial chunk layout for the first iteration of a simulation. By default, `None` is returned, indicating Meep should use its default chunk partitioning strategy.
- `should_adjust_chunk_layout()` decides whether the current layout is sufficiently poorly balanced to justify the up-front cost of reallocating the field arrays when changing chunk layouts.
- `compute_new_chunk_layout()` looks at the current chunk layout and per-process timing data to compute a new chunk layout which has chunk sizes adjusted to improve the load balance across compute nodes.
- `adjust_chunk_layout()` is syntactic sugar which will compute a new chunk layout, apply it to the simulation object, reset the simulation, and re-initialize the simulation.

```py
class AbstractChunkBalancer(abc.ABC):
"""Chunk balancer for dynamically load-balanced chunk layouts."""

def make_initial_chunk_layout(self, sim) -> mp.BinaryPartition:
"""Generates an initial chunk layout for simulation."""

def should_adjust_chunk_layout(self, sim) -> bool:
"""Is current layout imbalanced enough to justify rebuilding sim?"""

@abc.abstractmethod
def compute_new_chunk_layout(
self,
timing_measurements: MeepTimingMeasurements,
old_chunk_layout: mp.BinaryPartition,
chunk_volumes: Tuple[mp.grid_volume],
chunk_owners: np.ndarray) -> mp.BinaryPartition:
"""Rebalance chunks to equalize simulation time for each node."""

def adjust_chunk_layout(self, sim, **kwargs) -> None:
"""Computes a new chunk layout and applies it to sim."""
```

### Usage

Using the chunk balancer is very straightforward, and it can typically be integrated into existing Meep simulations with only a few lines of code. Here is a simple example:

```py
from meep.chunk_balancer import ChunkBalancer

chunk_balancer = ChunkBalancer()

# Compute an initial chunk layout
initial_chunk_layout = chunk_balancer.make_initial_chunk_layout()

sim = mp.Simulation(..., chunk_layout=initial_chunk_layout)
sim.init_sim()

for iteration in range(epochs):
sim.run(...)
# Adjust the chunk layout for the next iteration if needed
chunk_balancer.adjust_chunk_layout(sim, sensitivity=0.4)
```

Chunks can also be rebalanced between runs of a program by dumping and loading the chunk layout from a pickled object. For example:

```py
import meep as mp
from meep.chunk_balancer import ChunkBalancer
from meep.timing_measurements import MeepTimingMeasurements
import pickle
import os.path

# Fetch chunk layout from a previous run if it exists
if os.path.exists("path/to/chunk_layout.pkl"):
chunk_layout = pickle.load(open("path/to/chunk_layout.pkl", "rb"))
else:
chunk_layout = None

sim = mp.Simulation(..., chunk_layout=chunk_layout)
sim.init_sim()
sim.run(...)

# Compute and save chunk layout for next run
timings = MeepTimingMeasurements.new_from_simulation(sim)
chunk_volumes = sim.structure.get_chunk_volumes()
chunk_owners = sim.structure.get_chunk_owners()
next_chunk_layout = ChunkBalancer().compute_new_chunk_layout(
timings,
chunk_layout,
chunk_volumes,
chunk_owners,
sensitivity=0.4)

# Save chunk layout for next run
with open("path/to/chunk_layout.pkl", "wb") as f:
pickle.dump(next_chunk_layout, f)
```

### Chunk adjustment algorithm

The default chunk adjustment algorithm recursively traverses the `BinaryPartition` object and resizes the chunk volumes of the left and right children of each node to have an equal per-node simulation time. The new chunk layout has split positions which are a weighted average of the previous iteration's chunk layout and the newly computed layout, and the `sensitivity` parameter adjusts how fast the chunk sizes are adjusted. (`sensitivity=0.0` means no adjustment, `sensitivity=1.0` means an immediate snap to the predicted split positions, and `sensitivity=0.5` is the average of the old and new layouts.) The adjustment algorithm is summarized in pseudocode below:

```
def adjust_split_pos(node):
for subtree in {node.left, node.right}:
V := Σ volume for nodes in subtree
t := Σ sim time for nodes in subtree
n := number of processes in subtree
l := t / n # average load per process
Vₗ ↦ Vₗ / lₗ
Vᵣ ↦ Vᵣ / lᵣ
split_pos’ := dₘᵢₙ + (dₘₐₓ - dₘᵢₙ) * (Vₗ) / (Vₗ + Vᵣ)
# Adjust with sensitivity parameter
node.split_pos ↦ s * split_pos’ + (1-s) * node.split_pos
# Recurse through rest of the tree
adjust_split_pos(node.left)
adjust_split_pos(node.right)
```

### Load balancing results on shared clusters

The following benchmarking results show load-balancing improvements from a parallel run on a shared compute cluster in a datacenter. The per-core working times (blue and orange) start of initially poorly balanced using the default `split_by_cost` scheme, but the load balancing improves over successive iterations.

<center>
![](images/chunk_balancer_timing_stats.gif)
</center>

Using the normalized standard deviation in simulation times per iteration as a proxy for load-balancing efficacy, we can see that the load balance improves over a large number of runs with varying numbers of processes:

<center>
![](images/chunk_balancer_variance.jpg)
</center>
Binary file added doc/docs/images/adaptive_chunk_layout.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/docs/images/chunk_balancer_timing_stats.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/docs/images/chunk_balancer_variance.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions python/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,11 @@ TESTS = \
$(TEST_DIR)/test_antenna_radiation.py \
$(TEST_DIR)/test_array_metadata.py \
$(TEST_DIR)/test_bend_flux.py \
$(TEST_DIR)/test_binary_partition_utils.py \
$(BINARY_GRATING_TEST) \
$(TEST_DIR)/test_cavity_arrayslice.py \
$(TEST_DIR)/test_cavity_farfield.py \
$(TEST_DIR)/test_chunk_balancer.py \
$(TEST_DIR)/test_chunk_layout.py \
$(TEST_DIR)/test_chunks.py \
$(TEST_DIR)/test_cyl_ellipsoid.py \
Expand Down Expand Up @@ -85,6 +87,7 @@ TESTS = \
$(TEST_DIR)/test_simulation.py \
$(TEST_DIR)/test_special_kz.py \
$(TEST_DIR)/test_source.py \
$(TEST_DIR)/test_timing_measurements.py \
$(TEST_DIR)/test_user_defined_material.py \
$(TEST_DIR)/test_visualization.py \
$(WVG_SRC_TEST)
Expand Down Expand Up @@ -209,9 +212,12 @@ endif # MAINTAINER_MODE
# specification of python source files to be byte-compiled at installation
######################################################################
HL_IFACE = \
$(srcdir)/binary_partition_utils.py \
$(srcdir)/chunk_balancer.py \
$(srcdir)/geom.py \
$(srcdir)/simulation.py \
$(srcdir)/source.py \
$(srcdir)/timing_measurements.py \
$(srcdir)/visualization.py \
$(srcdir)/materials.py \
$(srcdir)/verbosity_mgr.py
Expand Down
Loading

0 comments on commit d12e544

Please sign in to comment.