Skip to content

Commit

Permalink
#2201: tools: NOT to merge: for now, add the user-defined problem to …
Browse files Browse the repository at this point in the history
…repo to make it easy to run
  • Loading branch information
lifflander committed Nov 29, 2023
1 parent 0ae6532 commit ea410f5
Show file tree
Hide file tree
Showing 5 changed files with 389 additions and 0 deletions.
101 changes: 101 additions & 0 deletions tools/user-defined-memory-toy-problem/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
These files describe a toy problem for testing whether a memory-aware load
balancer is achieving a sensible solution.

The 3D vt index in these files is:
(rank_index, decomp_index_on_rank, task_index_on_decomp).

Each task appears in the JSON files on its home rank (rank_index) where
communication costs will be zero, so no communication edges were included.
However, see the final paragraph for details about communication patterns that
will emerge when the tasks are migrated off the home rank.

The "user-defined" section of the JSON data contains the following fields:
- "task_serialized_bytes": This is the serialized size of the task, which can be
used for modeling the migration cost of the task. It should not be included
when computing the memory usage on a rank.
- "shared_id": This uniquely identifies a block of data on which multiple tasks
will operate. While not important, the shared_id was computing using:
shared_id = decomp_index_on_rank * num_ranks + rank_index
- "shared_bytes": This is the size of the block of data being operated on by the
relevant set of tasks. This memory cost will be incurred exactly once on each
MPI rank on which a task with this shared_id exists.
- "task_footprint_bytes": This is the footprinted size of the task in its
non-running state. We will incur this memory cost once for each individual
task, even if there are other tasks on this rank with the same shared_id. This
can be greater than task_serialized_bytes when the task has data members that
have greater capacity than is being used at serialization time.
- "task_working_bytes": This is the high water mark of the additional working
memory required by the individual task, such as temporary memory needed for
intermediate computation. This value does not include memory shared with other
tasks (i.e., shared_bytes), nor does it include the task_footprint bytes or
task_serialized_bytes. This cost is incurred for each individual task, but
only one at a time because tasks will not run concurrently.
- "rank_working_bytes": This is the amount of memory that the particular rank
needs while processing tasks. This may include global data, constants, and
completely unrelated data pre-allocated by the application. It is assumed to
be constant over time but may vary from rank to rank. This value does not
include shared_bytes, task_working_bytes, task_footprint_bytes, or
task_serialized_bytes.

The maximum memory usage for determining if task placement is feasible will be:
max_memory_usage = rank_working_bytes + shared_level_memory + max_task_level_memory

Computing shared_level_memory: Let S be the set of unique shared_id values on
the rank being considered. Then shared_level_memory is simply the sum of
shared_bytes values for each shared_id in S.

Computing max_task_level_memory: Let T be the set of all tasks on a rank,
regardless of the shared_id on which they operate. Then max_task_level_memory
is the sum of task_footprint_bytes values for each task in T plus the maximum
over the task_working_bytes values for each task in T.

Any communication-aware load balancer should also consider the communication
implied by this memory data. The task_serialized_bytes is the serialized size
of the task, so migrating it will require a communication of at least that size
from the home rank to the target rank. For applications where the shared memory
corresponding shared_id is writeable, at least shared_bytes per unique shared_id
on a target rank will need to be communicated from the target rank back to the
home rank after the relevant tasks complete.

***Spoilers***

Each of four ranks has three shared blocks. The memory constrains dictate that
at most four unique shared_id values can coexist on each rank. Under these
memory constraints, it is possible to perfectly balance the load (time). There
is more than one way to do so. The communication cost to migrate a task off-rank
is extremely low, but the cost to communicate back the result should be
significant enough to discourage migrating shared_ids to other ranks without it
resulting in a better balanced load.

One of the ranks has exactly the rank-averaged load, so it is best if the tasks
on that rank are left in place. Another rank has more than twice the
rank-averaged load. The sum of the loads for the task corresponding to one of
its shared_id values is more than the rank-averaged load, so the tasks for that
shared_id will need to be split across two ranks to achieve good balance. The
tasks for the other shared_ids across all ranks do not need to be split across
multiple ranks to perfectly balance the load (time).

Below is one solution with a perfectly balanced load and decent communication.
I have not evaluated whether it is optimal.

Rank 0:
[0,1,1],[0,1,3],[0,1,4] (part of block home)
[1,0,0],[1,0,1] (whole block not home)
[2,2,0],[2,2,1] (whole block not home)

Rank 1:
[1,1,0],[1,1,1] (home)
[1,2,0] (home)
[0,0,0],[0,0,1],[0,0,2] (whole block not home)
[0,1,0],[0,1,2] (part of block not home)

Rank 2:
[2,0,0],[2,0,1],[2,0,2] (home)
[2,1,0],[2,1,1] (home)
[0,2,0],[0,2,1],[0,2,2] (whole block not home)

Rank 3:
[3,0,0],[3,0,1],[3,0,2] (home)
[3,1,0],[3,1,1],[3,1,2] (home)
[3,2,0],[3,2,1],[3,2,2] (home)

285 changes: 285 additions & 0 deletions tools/user-defined-memory-toy-problem/toy_mem.0.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
{
"type": "LBDatafile",
"phases": [
{
"id": 0,
"tasks": [
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 2883587,
"index": [
0,
1,
4
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 10.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 4,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 2621443,
"index": [
0,
1,
3
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 35.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 4,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 2359299,
"index": [
0,
2,
2
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 10.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 8,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 2097155,
"index": [
0,
1,
2
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 25.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 4,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 1835011,
"index": [
0,
0,
2
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 10.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 0,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 524291,
"index": [
0,
1,
0
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 20.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 4,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 262147,
"index": [
0,
0,
0
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 10.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 0,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 786435,
"index": [
0,
2,
0
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 20.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 8,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 1048579,
"index": [
0,
0,
1
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 15.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 0,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 1310723,
"index": [
0,
1,
1
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 30.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 4,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
},
{
"entity": {
"collection_id": 7,
"home": 0,
"id": 1572867,
"index": [
0,
2,
1
],
"migratable": true,
"type": "object"
},
"node": 0,
"resource": "cpu",
"time": 5.0,
"user_defined": {
"rank_working_bytes": 980000000.0,
"shared_bytes": 1600000000.0,
"shared_id": 8,
"task_footprint_bytes": 1024.0,
"task_serialized_bytes": 1024.0,
"task_working_bytes": 110000000.0
}
}
]
}
]
}
1 change: 1 addition & 0 deletions tools/user-defined-memory-toy-problem/toy_mem.1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"type":"LBDatafile","phases":[{"id":0,"tasks":[{"entity":{"collection_id":7,"home":1,"id":1310727,"index":[1,1,1],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":5,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":1048583,"index":[1,0,1],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":1,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":786439,"index":[1,2,0],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":9,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":262151,"index":[1,0,0],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":1,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":524295,"index":[1,1,0],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":5,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}}]}]}
1 change: 1 addition & 0 deletions tools/user-defined-memory-toy-problem/toy_mem.2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"type":"LBDatafile","phases":[{"id":0,"tasks":[{"entity":{"collection_id":7,"home":2,"id":1835019,"index":[2,0,2],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":15.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":2,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":524299,"index":[2,1,0],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":6,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":262155,"index":[2,0,0],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":2,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":786443,"index":[2,2,0],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":10,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":1048587,"index":[2,0,1],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":2,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":1310731,"index":[2,1,1],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":6,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":1572875,"index":[2,2,1],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":10,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}}]}]}
1 change: 1 addition & 0 deletions tools/user-defined-memory-toy-problem/toy_mem.3.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"type":"LBDatafile","phases":[{"id":0,"tasks":[{"entity":{"collection_id":7,"home":3,"id":2359311,"index":[3,2,2],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":11,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":2097167,"index":[3,1,2],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":15.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":7,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1835023,"index":[3,0,2],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":3,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":524303,"index":[3,1,0],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":7,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":262159,"index":[3,0,0],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":3,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":786447,"index":[3,2,0],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":11,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1048591,"index":[3,0,1],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":3,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1310735,"index":[3,1,1],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":20.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":7,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1572879,"index":[3,2,1],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":11,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}}]}]}

0 comments on commit ea410f5

Please sign in to comment.