-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Software-based Performance Counters PR #4885
Conversation
Can one of the admins verify this patch? |
Reference paper for this PR. |
ok to test |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/fc3f553cdcceef1058554ddce9862137 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/e0e2bd8f20c25081c15e358cdd70da84 |
@davideberius a bit is missing in here is a patch that should fix it commit df715db8ab2ac50e2851f700c59707ea200589d7
Author: Gilles Gouaillardet <[email protected]>
Date: Thu Mar 8 14:52:15 2018 +0900
examples: fix Makefile for spc_example
Signed-off-by: Gilles Gouaillardet <[email protected]>
diff --git a/examples/Makefile b/examples/Makefile
index f7d7687..e77e783 100644
--- a/examples/Makefile
+++ b/examples/Makefile
@@ -13,8 +13,8 @@
# Copyright (c) 2011-2016 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2012 Los Alamos National Security, Inc. All rights reserved.
# Copyright (c) 2013 Mellanox Technologies, Inc. All rights reserved.
-# Copyright (c) 2017 Research Organization for Information Science
-# and Technology (RIST). All rights reserved.
+# Copyright (c) 2017-2018 Research Organization for Information Science
+# and Technology (RIST). All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
@@ -133,6 +133,8 @@ ring_c: ring_c.c
$(MPICC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@
connectivity_c: connectivity_c.c
$(MPICC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@
+spc_example: spc_example.c
+ $(MPICC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@
hello_cxx: hello_cxx.cc
$(MPICXX) $(CXXFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@ |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/c5876e6706ec5e36ae6922d74a9b71d2 |
Hmm, I wonder if it's something GNU specific in the code that breaks XLC?
|
Here is the verbose make at that point, in case it helps: shell$ make V=1
depbase=`echo pml_ob1_isend.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
/bin/sh ../../../../libtool --tag=CC --mode=compile xlc_r -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../ompi/include -I../../../../oshmem/include -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/private/autogen -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/hwloc/autogen -I../../../../ompi/mpiext/cuda/c -I../../../.. -I../../../../orte/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/hwloc/hwloc2a/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT pml_ob1_isend.lo -MD -MP -MF $depbase.Tpo -c -o pml_ob1_isend.lo pml_ob1_isend.c &&\
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: xlc_r -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../ompi/include -I../../../../oshmem/include -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/private/autogen -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/hwloc/autogen -I../../../../ompi/mpiext/cuda/c -I../../../.. -I../../../../orte/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/hwloc/hwloc2a/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT pml_ob1_isend.lo -MD -MP -MF .deps/pml_ob1_isend.Tpo -c pml_ob1_isend.c -fPIC -DPIC -o .libs/pml_ob1_isend.o
Calling signal handler...
1586-494 (U) INTERNAL COMPILER ERROR: Signal 11.
/opt/ibm/xlC/13.1.5/bin/.orig/xlc_r: error: 1501-230 Internal compiler error; please contact your Service Representative. For more information visit:
http://www.ibm.com/support/docview.wss?uid=swg21110810
make: *** [pml_ob1_isend.lo] Error 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this looks pretty great. I have a bunch of nit-picky comments, and I also have some slightly larger comments.
Additionally, there should likely be some form of documentation with this. E.g., a man page or something.
examples/spc_example.c
Outdated
MPI_Comm_size(MPI_COMM_WORLD, &size); | ||
if(size != 2) { | ||
fprintf(stderr, "ERROR: This test should be run with two MPI processes.\n"); | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to MPI_ABORT here.
examples/spc_example.c
Outdated
/* Make sure we found the counters */ | ||
if(index == -1) { | ||
fprintf(stderr, "ERROR: Couldn't find the appropriate SPC counter in the MPI_T pvars.\n"); | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: MPI_ABORT.
examples/spc_example.c
Outdated
/* Counter names to be read by ranks 0 and 1 */ | ||
char counter_names[2][40]; | ||
sprintf(counter_names[0], "runtime_spc_OMPI_BYTES_SENT_USER"); | ||
sprintf(counter_names[1], "runtime_spc_OMPI_BYTES_RECEIVED_USER"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be ok to use string constants here with C99 initialization (vs. using sprintf -- some compilers warn about using sprintf).
ompi/mca/pml/ob1/pml_ob1.c
Outdated
@@ -36,6 +36,7 @@ | |||
#include "opal_stdint.h" | |||
#include "opal/mca/btl/btl.h" | |||
#include "opal/mca/btl/base/base.h" | |||
#include "ompi/runtime/ompi_spc.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be used.
ompi/mca/pml/ob1/pml_ob1_recvfrag.c
Outdated
return remove_head_from_ordered_list(&proc->frags_cant_match); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this appears to be unrelated to SPC. Might want to exclude this change from this PR.
ompi/runtime/ompi_spc.c
Outdated
int i, j, rank, world_size, offset, err; | ||
long long *recv_buffer, *send_buffer; | ||
|
||
ompi_communicator_t *comm = &ompi_mpi_comm_world.comm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be safe to use MPI_COMM_WORLD here. I think you might want to dup COMM_WORLD during your startup and then use that here in the dump (and then be sure to free that communicator when you're done with it).
ompi/runtime/ompi_spc.c
Outdated
} | ||
|
||
/* Aggregate all of the information on rank 0 using MPI_Gather on MPI_COMM_WORLD */ | ||
if(rank == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference between these two blocks is the allocation of recv_buffer
, right? Might want to just make that the only thing dependent upon rank == 0
.
ompi/runtime/ompi_spc.c
Outdated
|
||
/* Once rank 0 has all of the information, print the aggregated counter values for each rank in order */ | ||
if(rank == 0) { | ||
fprintf(stdout, "OMPI Software Counters:\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use fprintf
or opal_output
here?
Also, we use the name "Open MPI" publicly; we rarely use the name OMPI
in help/output messages.
ompi/runtime/ompi_spc.c
Outdated
fprintf(stdout, "OMPI Software Counters:\n"); | ||
offset = 0; /* Offset into the recv_buffer for each rank */ | ||
for(j = 0; j < world_size; j++) { | ||
fprintf(stdout, "World Rank %d:\n", j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean "MPI_COMM_WORLD rank".
ompi/runtime/ompi_spc.c
Outdated
* This is denoted unlikely because the counters will often be turned off. | ||
*/ | ||
if(OPAL_UNLIKELY(attached_event[event_id] == 1 && *cycles == 0)) { | ||
*cycles = opal_timer_base_get_cycles(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above -- should this just use the high-precision timer.
Per discussion at face to face meeting on 03/20/18:
|
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/154b75812fe732a7906766782994adb0 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/9c0c0618be5243ba277701cda384f8ef |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/065cc2e477d9417e7d6e1987ee5dc903 |
The XLC compiler seems puzzled by the existence of the following line of code "do {} while (0);". |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/504e194c3b351871519716e8e85541b6 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/f6d621757ccb5ea57cfe94d7251b8f93 |
ompi/runtime/ompi_spc.h
Outdated
#else /* SPCs are not enabled */ | ||
|
||
#define SPC_INIT() \ | ||
do {} while (0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the problem with XL, why don't you just have it:
#define SPC_INIT() ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that some compilers do not like empty statements such as ";;".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully this line is our problem.
We can test this hypothesis by putting something random there right?
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/fc7047f0071fd59b8deb07430b64b532 |
Seems like we have a winner! |
Hmm. There's quite a few of my comments that are hidden by github in "outdated" blocks, but they weren't discussed / addressed. E.g., why invent your own bitmap handling instead of using |
opal_bitmap_t is an overkill for our needs and in this context the emphasis is on performance. This answer also applies to your question about frequencies, we cannot afford to call an expensive timer as we are in the critical path of the MPI library (as an example in the matching). |
I just want to follow up on this PR. It has been a while. Right now the code is functional. We just have some suggestions from @jsquyres that @bosilca does not agree with. (Regarding the CPU frequency changing dynamically, etc) So I request @ggouaillardet to pitch in. If there is no discussion anymore, I suggest that we should merge this before it goes to the void. |
Seems like no response will be made. @jsquyres can we agree and merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still quite hesitant about the variable frequency issue. I can be out-voted here, but Intel chips can (and do) change frequency depending on load. For a well-behaving HPC application (i.e., something that runs the CPU at 100%), the frequency likely won't change. But that may not be true in the general case (i.e., for all the plebian HPC apps out there).
examples/spc_example.c
Outdated
char name[256], description[256]; | ||
|
||
/* Counter names to be read by ranks 0 and 1 */ | ||
char counter_names[2][40] = { "runtime_spc_OMPI_BYTES_SENT_USER", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be char *counter_names[] = { "...
test/spc/spc_test.c
Outdated
char name[256], description[256]; | ||
|
||
/* Counter names to be read by ranks 0 and 1 */ | ||
char counter_names[2][40] = { "runtime_spc_OMPI_BYTES_SENT_USER", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before: char *counter_names[] = ...
ompi/runtime/ompi_spc.h
Outdated
*/ | ||
|
||
/* This enumeration serves as event ids for the various events */ | ||
enum ompi_spc_counters_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion earlier was in shorthand. This should probably be:
typedef enum ompi_spc_counters { ... } ompi_spec_counters_t
I think this will make it like many other declarations in OMPI.
ompi/runtime/ompi_spc.h
Outdated
OMPI_OOS_IN_QUEUE, | ||
OMPI_MAX_UNEXPECTED_IN_QUEUE, | ||
OMPI_MAX_OOS_IN_QUEUE, | ||
OMPI_SPC_NUM_COUNTERS /* This serves as the number of counters. It must be last. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My prior comment actually intended you to put SPC_
in all of these enums. OMPI_SPC_NUM_COUNTERS
was a single example, but it also applies to OMPI_SPC_MAX_OOS_IN_QUEUE
, OMPI_SPC_MAX_UNEXPECTED_IN_QUEUE
, ...etc. I know this is a PITA, but this is a limitation of C -- we need to have decent prefixes to our values as a primitive namespace. Sorry. 😦
ompi/runtime/ompi_spc.c
Outdated
*/ | ||
static int ompi_spc_get_count(const struct mca_base_pvar_t *pvar, void *value, void *obj_handle) | ||
{ | ||
(void) obj_handle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned twice, please use __opal_attribute_unused__
.
ompi/runtime/ompi_spc.c
Outdated
*counter_value /= sys_clock_freq_mhz; | ||
} | ||
/* If this is a high watermark counter, reset it after it has been read */ | ||
if(index == OMPI_MAX_UNEXPECTED_IN_QUEUE || index == OMPI_MAX_OOS_IN_QUEUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my previous comment:
It might be better to have some kind of attribute somewhere that indicates that a particular SPC counter needs to be reset after being read (Vs. having a hard-coded list of specific enums here in the code).
} | ||
|
||
/* Aggregate all of the information on rank 0 using MPI_Gather on MPI_COMM_WORLD */ | ||
send_buffer = (long long*)malloc(OMPI_SPC_NUM_COUNTERS * sizeof(long long)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my previous comment, please check for malloc
failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be fixed this evening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this was already fixed, but this line didn't change so the comment still shows.
ompi/runtime/ompi_spc.h
Outdated
} ompi_spc_t; | ||
|
||
/* Events data structure initialization function */ | ||
void events_init(void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name needs to be prefixed with ompi_spc_
.
ompi/runtime/ompi_spc.c
Outdated
int i, j, world_size, offset; | ||
long long *recv_buffer = NULL, *send_buffer; | ||
|
||
ompi_communicator_t *comm = &ompi_mpi_comm_world.comm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my prior comment:
It may not be safe to use MPI_COMM_WORLD here. I think you might want to dup COMM_WORLD during your startup and then use that here in the dump (and then be sure to free that communicator when you're done with it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this this evening.
@jsquyres what we are measuring are very short time intervals (such as the matching time). The change for a clock frequency change during such measurements is minimal (frankly non-existent) so I would not be concerned about the lack of accuracy. Moreover, we never return to the user absolute timing, so even if there is a drift in frequency during the execution the impact on the value returned will be minimal. |
@bosilca I hear what you're saying. And I agree: it'll be pretty rare for the frequency to change while you're making the measurement. But:
Here's an example:
That can definitely happen. I'm also curious: what is the usefulness of having fast-but-inaccurate timings? If x% of your timings could be incorrect, how is that data useful? Maybe it would be enough to have 2 different methods of timing: one based on your current frequency method, and another based on slower/more-accurate methods (e.g.,
|
Right after I hit submit, I thought of a better example where the frequency can change: while blocking for disk I/O. E.g.: while (!done) {
MPI_Test(...);
if (flag) break;
read(fd, large_chunk_of_disk_data, ...);
} Given that the (this is somewhat moot because the frequency is only measured at the beginning of the app, but it's just better example than the |
The example you provided will work just fine with modern processors. For older processors the results might be off if there are frequency changes between MPI calls, but there I am not sure that people really care about measuring matching time. |
I checked to make sure that we are verifying that the timers are monotonic. The following code shows how this is done right now. Should I add an SPC specific warning message here?
|
We would like to get this into 4.0. The fork is this Friday IIRC. We should reach agreement on that timer/tick rate or actually remove it completely. The new matching is fast anyway. |
@jsquyres poke |
ompi/runtime/ompi_spc.c
Outdated
void ompi_spc_fini(void) | ||
{ | ||
#if SPC_ENABLE == 1 | ||
ompi_spc_dump(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I forgot to add the check for whether dumping is enabled. I'll add this when I get home this evening.
Ok, I've made all of the changes requested. What is the status of the frequency issue? |
I don't think there is an issue anymore. We discussed it with Jeff privately, and I think we agreed that as long as there is a warning we are good to go. But I hope @jsquyres will confirm and then approve this PR so that we can merge. |
I just filed a quick PR (davideberius#1 -- I couldn't seem to push to @davideberius's PR branch for some reason) with some minor suggestions. In terms of the frequency issue, there's two points:
|
@davideberius highlighted that we have the same level of protection for timers as MPI_Wtime. And we inherit the same level of warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great. Squash and go.
Ok, I'll squash this when I get back home tonight. Should this all be one commit? |
Technically there is no strict rule on the number of commits (the details are left to the PR author(s)). In this particular case, due to a significant number of updates, I would indeed squash down to 1. |
…ral counters. This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI. All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example. Signed-off-by: David Eberius <[email protected]>
I have squashed all of the commits into a single commit. |
bot:mellanox:retest Looks like the Jenkins/Github git clone bug again. 😦 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca Ok, fair enough that we're in the same category as MPI_WTICK/MPI_WTIME, which means that my concerns are not specific to this PR (i.e., we shouldn't hold up this PR any further, and should instead address that issue outside of this PR). I would still like to see some kind of disable-able warning about this ("your time may not be constant!"), but that will be outside of this PR.
More than what @davideberius pinpointed to few comments above ?
|
@bosilca Yes, more than that (but perhaps similar?) -- Dumb question: is the OMPI time-invariant clock-reading-code getting its values from a core or from a package? (I assume core, but I don't know that for a fact) I.e., if a process is bound to a socket/package, is it still going to always get consistent time? |
My understanding is that it depends on the processor, but I have no idea how to check what processor has what type of monotonic hwclock. I agree that from a user perspective it could be nice to be extremely attentive, but realistically it's a lot of work for little benefit. |
Added Software-based Performance Counters driver code along with several counters.