-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
#2201: tools: NOT to merge: for now, add the user-defined problem to …
…repo to make it easy to run
- Loading branch information
1 parent
0ae6532
commit ea410f5
Showing
5 changed files
with
389 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
These files describe a toy problem for testing whether a memory-aware load | ||
balancer is achieving a sensible solution. | ||
|
||
The 3D vt index in these files is: | ||
(rank_index, decomp_index_on_rank, task_index_on_decomp). | ||
|
||
Each task appears in the JSON files on its home rank (rank_index) where | ||
communication costs will be zero, so no communication edges were included. | ||
However, see the final paragraph for details about communication patterns that | ||
will emerge when the tasks are migrated off the home rank. | ||
|
||
The "user-defined" section of the JSON data contains the following fields: | ||
- "task_serialized_bytes": This is the serialized size of the task, which can be | ||
used for modeling the migration cost of the task. It should not be included | ||
when computing the memory usage on a rank. | ||
- "shared_id": This uniquely identifies a block of data on which multiple tasks | ||
will operate. While not important, the shared_id was computing using: | ||
shared_id = decomp_index_on_rank * num_ranks + rank_index | ||
- "shared_bytes": This is the size of the block of data being operated on by the | ||
relevant set of tasks. This memory cost will be incurred exactly once on each | ||
MPI rank on which a task with this shared_id exists. | ||
- "task_footprint_bytes": This is the footprinted size of the task in its | ||
non-running state. We will incur this memory cost once for each individual | ||
task, even if there are other tasks on this rank with the same shared_id. This | ||
can be greater than task_serialized_bytes when the task has data members that | ||
have greater capacity than is being used at serialization time. | ||
- "task_working_bytes": This is the high water mark of the additional working | ||
memory required by the individual task, such as temporary memory needed for | ||
intermediate computation. This value does not include memory shared with other | ||
tasks (i.e., shared_bytes), nor does it include the task_footprint bytes or | ||
task_serialized_bytes. This cost is incurred for each individual task, but | ||
only one at a time because tasks will not run concurrently. | ||
- "rank_working_bytes": This is the amount of memory that the particular rank | ||
needs while processing tasks. This may include global data, constants, and | ||
completely unrelated data pre-allocated by the application. It is assumed to | ||
be constant over time but may vary from rank to rank. This value does not | ||
include shared_bytes, task_working_bytes, task_footprint_bytes, or | ||
task_serialized_bytes. | ||
|
||
The maximum memory usage for determining if task placement is feasible will be: | ||
max_memory_usage = rank_working_bytes + shared_level_memory + max_task_level_memory | ||
|
||
Computing shared_level_memory: Let S be the set of unique shared_id values on | ||
the rank being considered. Then shared_level_memory is simply the sum of | ||
shared_bytes values for each shared_id in S. | ||
|
||
Computing max_task_level_memory: Let T be the set of all tasks on a rank, | ||
regardless of the shared_id on which they operate. Then max_task_level_memory | ||
is the sum of task_footprint_bytes values for each task in T plus the maximum | ||
over the task_working_bytes values for each task in T. | ||
|
||
Any communication-aware load balancer should also consider the communication | ||
implied by this memory data. The task_serialized_bytes is the serialized size | ||
of the task, so migrating it will require a communication of at least that size | ||
from the home rank to the target rank. For applications where the shared memory | ||
corresponding shared_id is writeable, at least shared_bytes per unique shared_id | ||
on a target rank will need to be communicated from the target rank back to the | ||
home rank after the relevant tasks complete. | ||
|
||
***Spoilers*** | ||
|
||
Each of four ranks has three shared blocks. The memory constrains dictate that | ||
at most four unique shared_id values can coexist on each rank. Under these | ||
memory constraints, it is possible to perfectly balance the load (time). There | ||
is more than one way to do so. The communication cost to migrate a task off-rank | ||
is extremely low, but the cost to communicate back the result should be | ||
significant enough to discourage migrating shared_ids to other ranks without it | ||
resulting in a better balanced load. | ||
|
||
One of the ranks has exactly the rank-averaged load, so it is best if the tasks | ||
on that rank are left in place. Another rank has more than twice the | ||
rank-averaged load. The sum of the loads for the task corresponding to one of | ||
its shared_id values is more than the rank-averaged load, so the tasks for that | ||
shared_id will need to be split across two ranks to achieve good balance. The | ||
tasks for the other shared_ids across all ranks do not need to be split across | ||
multiple ranks to perfectly balance the load (time). | ||
|
||
Below is one solution with a perfectly balanced load and decent communication. | ||
I have not evaluated whether it is optimal. | ||
|
||
Rank 0: | ||
[0,1,1],[0,1,3],[0,1,4] (part of block home) | ||
[1,0,0],[1,0,1] (whole block not home) | ||
[2,2,0],[2,2,1] (whole block not home) | ||
|
||
Rank 1: | ||
[1,1,0],[1,1,1] (home) | ||
[1,2,0] (home) | ||
[0,0,0],[0,0,1],[0,0,2] (whole block not home) | ||
[0,1,0],[0,1,2] (part of block not home) | ||
|
||
Rank 2: | ||
[2,0,0],[2,0,1],[2,0,2] (home) | ||
[2,1,0],[2,1,1] (home) | ||
[0,2,0],[0,2,1],[0,2,2] (whole block not home) | ||
|
||
Rank 3: | ||
[3,0,0],[3,0,1],[3,0,2] (home) | ||
[3,1,0],[3,1,1],[3,1,2] (home) | ||
[3,2,0],[3,2,1],[3,2,2] (home) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,285 @@ | ||
{ | ||
"type": "LBDatafile", | ||
"phases": [ | ||
{ | ||
"id": 0, | ||
"tasks": [ | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 2883587, | ||
"index": [ | ||
0, | ||
1, | ||
4 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 10.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 4, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 2621443, | ||
"index": [ | ||
0, | ||
1, | ||
3 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 35.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 4, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 2359299, | ||
"index": [ | ||
0, | ||
2, | ||
2 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 10.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 8, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 2097155, | ||
"index": [ | ||
0, | ||
1, | ||
2 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 25.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 4, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 1835011, | ||
"index": [ | ||
0, | ||
0, | ||
2 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 10.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 0, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 524291, | ||
"index": [ | ||
0, | ||
1, | ||
0 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 20.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 4, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 262147, | ||
"index": [ | ||
0, | ||
0, | ||
0 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 10.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 0, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 786435, | ||
"index": [ | ||
0, | ||
2, | ||
0 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 20.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 8, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 1048579, | ||
"index": [ | ||
0, | ||
0, | ||
1 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 15.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 0, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 1310723, | ||
"index": [ | ||
0, | ||
1, | ||
1 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 30.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 4, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
}, | ||
{ | ||
"entity": { | ||
"collection_id": 7, | ||
"home": 0, | ||
"id": 1572867, | ||
"index": [ | ||
0, | ||
2, | ||
1 | ||
], | ||
"migratable": true, | ||
"type": "object" | ||
}, | ||
"node": 0, | ||
"resource": "cpu", | ||
"time": 5.0, | ||
"user_defined": { | ||
"rank_working_bytes": 980000000.0, | ||
"shared_bytes": 1600000000.0, | ||
"shared_id": 8, | ||
"task_footprint_bytes": 1024.0, | ||
"task_serialized_bytes": 1024.0, | ||
"task_working_bytes": 110000000.0 | ||
} | ||
} | ||
] | ||
} | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"type":"LBDatafile","phases":[{"id":0,"tasks":[{"entity":{"collection_id":7,"home":1,"id":1310727,"index":[1,1,1],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":5,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":1048583,"index":[1,0,1],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":1,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":786439,"index":[1,2,0],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":9,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":262151,"index":[1,0,0],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":1,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":1,"id":524295,"index":[1,1,0],"migratable":true,"type":"object"},"node":1,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":5,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}}]}]} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"type":"LBDatafile","phases":[{"id":0,"tasks":[{"entity":{"collection_id":7,"home":2,"id":1835019,"index":[2,0,2],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":15.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":2,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":524299,"index":[2,1,0],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":6,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":262155,"index":[2,0,0],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":2,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":786443,"index":[2,2,0],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":10,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":1048587,"index":[2,0,1],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":2,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":1310731,"index":[2,1,1],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":6,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":2,"id":1572875,"index":[2,2,1],"migratable":true,"type":"object"},"node":2,"resource":"cpu","time":2.5,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":10,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}}]}]} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"type":"LBDatafile","phases":[{"id":0,"tasks":[{"entity":{"collection_id":7,"home":3,"id":2359311,"index":[3,2,2],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":11,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":2097167,"index":[3,1,2],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":15.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":7,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1835023,"index":[3,0,2],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":3,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":524303,"index":[3,1,0],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":7,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":262159,"index":[3,0,0],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":3,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":786447,"index":[3,2,0],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":11,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1048591,"index":[3,0,1],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":5.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":3,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1310735,"index":[3,1,1],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":20.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":7,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}},{"entity":{"collection_id":7,"home":3,"id":1572879,"index":[3,2,1],"migratable":true,"type":"object"},"node":3,"resource":"cpu","time":10.0,"user_defined":{"rank_working_bytes":980000000.0,"shared_bytes":1600000000.0,"shared_id":11,"task_footprint_bytes":1024.0,"task_serialized_bytes":1024.0,"task_working_bytes":110000000.0}}]}]} |