-
Notifications
You must be signed in to change notification settings - Fork 16
/
README.ToM
158 lines (118 loc) · 5.97 KB
/
README.ToM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
TAUoverMRNet 2.0-alpha build instructions:
==========================================
By: Chee Wai Lee ([email protected])
Last Updated: 11/8/2010
Introduction:
=============
This documents the steps required to build the alpha re-engineered
implementation of TAUoverMRNet (ToM) with TAU and a beta (unreleased)
copy of MRNet 3.0.
Please be reminded again. This feature set is still in alpha
development and is not intended to be used and supported in production.
Software Requirements:
======================
1. A copy of TAU pre 2.19.3 release (or higher number depending on
actual release which will officially support ToM).
2. A copy of MRNet 3.0 (or beta). The current code base supports
the build dated 2010-06-30 (yyyy-mm-dd).
Other Requirements:
===================
1. ToM will only work with TAU-instrumented MPI codes.
Current Features:
=================
1. ToM monitoring operations are invoked by collective calls
TAU_ONLINE_DUMP() inserted into application code by the user.
2. The following parallel monitoring operations are invoked, one
after the other, with TAU_ONLINE_DUMP():
a. The Mean, Standard Deviation, Minimum and Maximum values for
each event metric and counter combination are computed through
the reduction tree and deposited at the front-end. The data is
currently only written to disk as a standard TAU profile file
named "mean.<frame>.0.0" representing a mean profile for
the entire application across all threads.
b. A Histogram of thread counts is generated for each event metric
and counter combination. This operation makes use of information
generated by operation (a), requiring one reduction. The data is
deposited at the front-end and is currently only written to disk
as "tau.histograms.<frame>".
c. K-means Clustering is applied to the performance data. Each data
point in the clustering algorithm represents a thread. The
dimensions of the clustering algorithm spans all events
encountered by the application so far. One set of cluster results
are generated for each metric and counter combination. The user
has control over the choice of K through the environment variable
TOM_CLUSTER_K. The number of reductions required depends on
algorithm convergence. The algorithm creates a directory for
each frame named "cluster_<frame>". Inside the directory for a
frame, a standard TAU profile named "profile.<k>.0.0" is
generated for each K. A profile for <k> represents the mean
profile across all threads found to belong to the cluster <k>.
Building ToM:
=============
1. Configure TAU. This is the minimal configuration required. Other
good configurations to have are PDT and PAPI support:
cd <tau root directory>;
./configure -mpi
-mpiinc=<mpi include dir, if needed>
-mpilib=<mpi lib dir, if needed>
-mon-mrnet=<mrnet source root>
-mon-mrnet-lib=<dir where mrnet libraries are installed>
2. Build TAU:
make install
3. Edit ToM Front-End parameters:
cd tools/src/ToM;
edit Makefile - set INSTALL_ROOT, CXX and CXXFLAGS
4. Build ToM Front-End:
make install
5. Locate appropriate start script in the scripts directory. These
will generally be named startToM_<platform>.sh. The selected
script should be copied to the directory where the experiments
with ToM are to be run.
6. Edit ToM supporting tools:
cd probeHosts;
edit Makefile - set INSTALL_DIR and settings required for MPI.
This copies the files "probe" and "probeDiff" into the install
directory.
7. Build supporting tools:
make install
8. Make sure INSTALL_ROOT/bin is in the PATH environment variable; and
INSTALL_ROOT/lib is in the LD_LIBRARY_PATH environment variable.
9. Build your application with the ToM-supported build of TAU. Make
sure the environment variable TAU_MAKEFILE is correctly set.
Running a ToM-supported instrumented application:
=================================================
These instructions are platform specific:
1. Cray XT5 platform:
a. In your job script, start the front-end and application using
the start script:
./startToM_craycnl.sh <profiledir> <num total cores> ToM_FE <num app cores> <mrnet fanout> <num mrnet cores> <application executable> [<app options>]
Example:
./startToM_craycnl.sh output_dir 24 ToM_FE 12 2 12 ./hello-ToM 21
b. On the Cray XT5, MRNet tree processes cannot share nodes with
application processes. As a result, these two variables have to
respect node boundaries where core counts are concerned:
i. <num app cores>
ii. <num mrnet cores>
Finally, <num total cores> = <num app cores> + <num mrnet cores>
2. Linux Rocks clusters:
a. On Linux Rocks clusters, depending on cluster setup, you may need
to use ssh-agent to allow MRNet processes to communicate via ssh
without password requirements. An example to do so
(from the node you are first allocated interactively):
ssh-agent bash;
ssh-add
You will be prompted for your password. This also implies you cannot
use such a system non-interactively.
b. Like on the Cray platform, both front-end and back-ends can be
started by the single script:
./startToM_linux.sh <profiledir> <num total cores> ToM_FE <num app cores> <mrnet fanout> <num mrnet cores> <application executable> [<app options>]
Platform-specific Issues:
=========================
Known ToM Issues:
=================
1. ToM is still currently in alpha development. There are many esoteric
restrictions and design decisions that need to be ironed-out for
production use. This document will continue to be updated with lists
as development proceeds towards a proper release.
2. Please contact [email protected] for any questions and comments.
Thank you.