-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature 280 lb sim output only #281
Conversation
|
||
/*static*/ void ProcStats::closeStatsFile() { | ||
if (stats_file_) { | ||
fclose(stats_file_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to also assign nullptr
here, or else that condition can trigger multiple times, even though the function superficially looks idempotent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed problem
@@ -72,6 +73,10 @@ struct ProcStats { | |||
static void clearStats(); | |||
static void releaseLB(); | |||
|
|||
static void createStatsFile(); | |||
static void outputStatsFile(); | |||
static void closeStatsFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If createStatsFile
and closeStatsFile
are going to be public, then outputStatsFile
probably shouldn't be opening and (more importantly) closing it automatically. Someone might call open manually, and the output might close it on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by making {close,create}StatsFile private
// Barrier: wait for node 0 to create directory before trying to put a file in | ||
// the stats destination directory | ||
if (curRT) { | ||
curRT->systemSync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that calls sync()
, but doesn't also call a barrier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync()
is a barrier under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - I read that as the POSIX sync()
, not a call to the sibling member function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 questions here to make sure I fully understand here:
- Each of these files provides a lit of task objects and assorted timings, varying over time. For example:
0,68719476736,0.023175
...
1,68719476736,0.0237269
in the first output file means that the same object (with unique ID 68719476736) resides on the same processor during steps 0 and 1, and that it the compute time of that object varied between those steps. - The LB operations would happen between the two steps. Therefore, if this object (with unique ID 68719476736) had been offloaded from say, processor 0 to processor 2 during the LB stage, its time information during step 2:
1,68719476736,0.0237269
would no longer be found in the same file output-0.csv, but in output-2.csv.
Is this a correct understanding?
Philippe, You are exactly correct on both counts. The benchmark I'm running does approximately the same work for each iteration (just a micro benchmark for LB), but it varies a little. I'm always dumping the objects that are local to each processor and the object ID is unique for the duration of the program so it can be cross identified regardless of what processor it is mapped to. |
@ppebay Do you have a suggestion for the format when I add the communication edges to the output? Or do you want to convert this to an Exodus format before I do that? |
Thanks for confirming, this information layout is perfect for the "load-only" algorithm. Regarding edges, here is how I understand things: the primary quantities in VT are object-to-object links. However the ones that we would take into account for LB purposes would be processor-to-processor links (it is presumably reasonable to assume that if object A talks to object B but on the same processor, this has 0 communication cost -- at least as a first-order approximation), whereas if the latter resides on process M and the latter on processor N, then link A->B contributes to link M->N which is the one that we want to consider. Other objects communicating similarly would also contribute to link M->N, which can thus be viewed as an aggregate -- in other words, a derived quantity. So, the question thus naturally becomes: do we want to (1) store the object to object links (in which case these will be aggregated into processor to processor links at ingestion time when the LB phase starts), or, alternatively, (2) compute those statistics inside VT and dump the already computed aggregates? In terms of cost, although it does not cost more to tally the costs before than reading them, approach (1) would result in much larger output files as many more links would be stored. That would in turn result in longer write/read times. So, the only advantage that I can see in approach (1) with respect to (2) would be to preserve the primary information (the object to object traffic) as opposed to the derived aggregates, but I fail to see how that could be useful for the methodology we are considering. So, I am therefore suggesting that VT tallies the object-to-object communication, per-processor, and store that aggregate information after the object lines, as follows: |
In modeling or simulating how each object's communication contributes to the load it represents, I do think we need to record the object-level communication graph, not the contracted process-level graph that arises from it. When we evaluate the hypothetical re-assignment of an object to a different processor, we need to consider the change in overall communication resulting from moving that object. So, we would at least need to know what processes each object's communication involved. When we evaluate re-assignment of multiple objects, then the information of what processes a given object was communicating with is no longer informative, because some of the communication partners may also have moved. So, I conclude that we need full object-level communication records to be available to the load balancing strategies, and any simulations thereof |
@PhilMiller Agreed. |
Ah, good to hear. Indeed, for most reasonable applications, each object is only expected to interact with a small, often constant, number of other objects, so there would be only a small multiplicative factor increase in the footprint of the object graph data. |
So, continuing with the object-to-object communication: what does the weight (scalar) associated with an edge represent? It is at a flow through the link? Or rather the amount of data that must be transferred between the two end-points? |
The weight (scalar) for object-to-object communication can be any normalized scalar, but I think that bytes makes the most sense because it's easily compatible with information we obtain about the network (if/when we add topology as a factor). What do you think of the following format?
Each line will have the object, time and then a series of objects it communicates, which may be empty. Each line will be variable in length. Or we could but this on separate lines, so there would be two different formats. If there are 4 elements on a line it would imply communication, else it would be a computation line.
|
I have a preference for the latter because it is closer to what we could easily map into an ExodusII format via the VTK I/O facilities; in our case, 3 elements on a line imply a node (vertex), and 4 an edge (line). In fact with just a few extra boilerplate lines in the file it could be used as-in in the old-style ASCII VTK readers. So, I am voting for the latter. TY. |
d2004a4
to
ea1e42a
Compare
This pull request adds functionality to VT that enables it to dump load information to file. This includes only the computation stats so far (not the communication graph).
It adds 3 new optional arguments to enable (opt-in):
mpirun -n 4 ./examples/lb_iter --vt_lb_stats --vt_lb_stats_dir=my-stats-dir --vt_lb_stats_file=stats
This creates the tree:
Each file in CSV follows this format:
<iter/phase>, <distributed-unique-object-id>, <time-in-seconds>