Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature 280 lb sim output only #281

Merged
merged 13 commits into from
Feb 27, 2019
Merged

Conversation

lifflander
Copy link
Collaborator

This pull request adds functionality to VT that enables it to dump load information to file. This includes only the computation stats so far (not the communication graph).

It adds 3 new optional arguments to enable (opt-in):
mpirun -n 4 ./examples/lb_iter --vt_lb_stats --vt_lb_stats_dir=my-stats-dir --vt_lb_stats_file=stats

This creates the tree:

<current-dir>
    my-stats-dir
        stats.0.out
        stats.1.out
        stats.2.out
        stats.3.out

Each file in CSV follows this format:
<iter/phase>, <distributed-unique-object-id>, <time-in-seconds>

0,68719476736,0.023175
0,64424509440,0.0229769
0,60129542144,0.0236919
0,55834574848,0.022301
0,51539607552,0.0231979
0,47244640256,0.0263572
0,42949672960,0.024205
0,38654705664,0.029372
0,34359738368,3.79086e-05
0,30064771072,3.21865e-05
0,25769803776,3.19481e-05
0,21474836480,3.60012e-05
0,17179869184,3.19481e-05
0,12884901888,3.31402e-05
0,8589934592,3.69549e-05
0,4294967296,6.10352e-05
1,68719476736,0.0237269
1,64424509440,0.022706
1,60129542144,0.0231159
1,55834574848,0.023596
1,51539607552,0.0259681
1,47244640256,0.0247791
1,42949672960,0.0229402
1,38654705664,0.0230789
1,34359738368,3.79086e-05
1,30064771072,2.5034e-05
1,25769803776,2.47955e-05
1,21474836480,2.38419e-05
1,17179869184,4.60148e-05
1,12884901888,2.40803e-05
1,8589934592,4.1008e-05
1,4294967296,2.71797e-05


/*static*/ void ProcStats::closeStatsFile() {
if (stats_file_) {
fclose(stats_file_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to also assign nullptr here, or else that condition can trigger multiple times, even though the function superficially looks idempotent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed problem

@@ -72,6 +73,10 @@ struct ProcStats {
static void clearStats();
static void releaseLB();

static void createStatsFile();
static void outputStatsFile();
static void closeStatsFile();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If createStatsFile and closeStatsFile are going to be public, then outputStatsFile probably shouldn't be opening and (more importantly) closing it automatically. Someone might call open manually, and the output might close it on them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by making {close,create}StatsFile private

// Barrier: wait for node 0 to create directory before trying to put a file in
// the stats destination directory
if (curRT) {
curRT->systemSync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that calls sync(), but doesn't also call a barrier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync() is a barrier under the hood.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - I read that as the POSIX sync(), not a call to the sibling member function.

Copy link
Contributor

@ppebay ppebay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 questions here to make sure I fully understand here:

  1. Each of these files provides a lit of task objects and assorted timings, varying over time. For example:
    0,68719476736,0.023175
    ...
    1,68719476736,0.0237269
    in the first output file means that the same object (with unique ID 68719476736) resides on the same processor during steps 0 and 1, and that it the compute time of that object varied between those steps.
  2. The LB operations would happen between the two steps. Therefore, if this object (with unique ID 68719476736) had been offloaded from say, processor 0 to processor 2 during the LB stage, its time information during step 2:
    1,68719476736,0.0237269
    would no longer be found in the same file output-0.csv, but in output-2.csv.

Is this a correct understanding?

@lifflander
Copy link
Collaborator Author

Philippe,

You are exactly correct on both counts. The benchmark I'm running does approximately the same work for each iteration (just a micro benchmark for LB), but it varies a little. I'm always dumping the objects that are local to each processor and the object ID is unique for the duration of the program so it can be cross identified regardless of what processor it is mapped to.

@lifflander
Copy link
Collaborator Author

@ppebay Do you have a suggestion for the format when I add the communication edges to the output? Or do you want to convert this to an Exodus format before I do that?

@ppebay
Copy link
Contributor

ppebay commented Feb 23, 2019

Thanks for confirming, this information layout is perfect for the "load-only" algorithm.

Regarding edges, here is how I understand things: the primary quantities in VT are object-to-object links. However the ones that we would take into account for LB purposes would be processor-to-processor links (it is presumably reasonable to assume that if object A talks to object B but on the same processor, this has 0 communication cost -- at least as a first-order approximation), whereas if the latter resides on process M and the latter on processor N, then link A->B contributes to link M->N which is the one that we want to consider. Other objects communicating similarly would also contribute to link M->N, which can thus be viewed as an aggregate -- in other words, a derived quantity.

So, the question thus naturally becomes: do we want to (1) store the object to object links (in which case these will be aggregated into processor to processor links at ingestion time when the LB phase starts), or, alternatively, (2) compute those statistics inside VT and dump the already computed aggregates?

In terms of cost, although it does not cost more to tally the costs before than reading them, approach (1) would result in much larger output files as many more links would be stored. That would in turn result in longer write/read times. So, the only advantage that I can see in approach (1) with respect to (2) would be to preserve the primary information (the object to object traffic) as opposed to the derived aggregates, but I fail to see how that could be useful for the methodology we are considering.

So, I am therefore suggesting that VT tallies the object-to-object communication, per-processor, and store that aggregate information after the object lines, as follows:
<iter/phase>, ,
We could also store the incoming volume, per processor, but those should all be implicit. For instance
if there were only processors 0 and 1 participating in the run at iteration 0, with 0 sending 1B to 1, and receiving 2B from it during that iteration, we would have the following line at the end of stats.0.csv:
0,1,1
and the following line at the end of stats.1.csv:
0,0,2
So, although only outflows would be explicit, (0 sends 1 to 1; 1 sends 2 to 0), the entire traffic information would be implicitly available because the former directly implies (0 receives 2 from 1; 1 receives 1 from 0).
I thus think that there is no point in saving the bi-directional traffic. We can decide to save either out- or inflow, in order to reduce I/O time.

@PhilMiller
Copy link
Member

In modeling or simulating how each object's communication contributes to the load it represents, I do think we need to record the object-level communication graph, not the contracted process-level graph that arises from it.

When we evaluate the hypothetical re-assignment of an object to a different processor, we need to consider the change in overall communication resulting from moving that object. So, we would at least need to know what processes each object's communication involved.

When we evaluate re-assignment of multiple objects, then the information of what processes a given object was communicating with is no longer informative, because some of the communication partners may also have moved. So, I conclude that we need full object-level communication records to be available to the load balancing strategies, and any simulations thereof

@ppebay
Copy link
Contributor

ppebay commented Feb 25, 2019

@PhilMiller Agreed.
Sorry for not having mentioned it earlier here, but Jonathan and I spoke about it this weekend and yes indeed we need to keep track of the object-to-object communications. The expectation is that for real cases it will not result in a quadratic or worse increase of output file size.
Thank you.

@PhilMiller
Copy link
Member

Ah, good to hear. Indeed, for most reasonable applications, each object is only expected to interact with a small, often constant, number of other objects, so there would be only a small multiplicative factor increase in the footprint of the object graph data.

@ppebay
Copy link
Contributor

ppebay commented Feb 25, 2019

So, continuing with the object-to-object communication: what does the weight (scalar) associated with an edge represent? It is at a flow through the link? Or rather the amount of data that must be transferred between the two end-points?

@lifflander
Copy link
Collaborator Author

The weight (scalar) for object-to-object communication can be any normalized scalar, but I think that bytes makes the most sense because it's easily compatible with information we obtain about the network (if/when we add topology as a factor).

What do you think of the following format?

<iter/phase>, <object-id>, <time-in-seconds>, [<comm-object-id>, <comm-bytes>]...

Each line will have the object, time and then a series of objects it communicates, which may be empty. Each line will be variable in length.

Or we could but this on separate lines, so there would be two different formats. If there are 4 elements on a line it would imply communication, else it would be a computation line.

<iter/phase>, <object-id>, <time-in-seconds>
<iter/phase>, <object-id1>, <object-id2>, <num-bytes>

@ppebay
Copy link
Contributor

ppebay commented Feb 26, 2019

I have a preference for the latter because it is closer to what we could easily map into an ExodusII format via the VTK I/O facilities; in our case, 3 elements on a line imply a node (vertex), and 4 an edge (line). In fact with just a few extra boilerplate lines in the file it could be used as-in in the old-style ASCII VTK readers. So, I am voting for the latter. TY.

@lifflander
Copy link
Collaborator Author

@ppebay Ok, I will implement with separate lines...that also makes it easy to ignore communication edges if you choose to do so. I am about 80% done with the implementation on #283

@lifflander lifflander force-pushed the feature-280-lb-sim-output-only branch from d2004a4 to ea1e42a Compare February 27, 2019 03:04
@lifflander lifflander merged commit 82fb410 into develop Feb 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants