Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Software-based Performance Counters PR #4885

Merged
merged 1 commit into from
Jun 12, 2018

Conversation

davideberius
Copy link
Contributor

@davideberius davideberius commented Mar 7, 2018

Added Software-based Performance Counters driver code along with several counters.

This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.

All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@thananon
Copy link
Member

thananon commented Mar 7, 2018

Reference paper for this PR.

@jsquyres
Copy link
Member

jsquyres commented Mar 7, 2018

ok to test

@ibm-ompi
Copy link

ibm-ompi commented Mar 7, 2018

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/fc3f553cdcceef1058554ddce9862137

@ibm-ompi
Copy link

ibm-ompi commented Mar 7, 2018

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e0e2bd8f20c25081c15e358cdd70da84

@ggouaillardet
Copy link
Contributor

@davideberius a bit is missing in example/Makefile

here is a patch that should fix it

commit df715db8ab2ac50e2851f700c59707ea200589d7
Author: Gilles Gouaillardet <[email protected]>
Date:   Thu Mar 8 14:52:15 2018 +0900

    examples: fix Makefile for spc_example
    
    Signed-off-by: Gilles Gouaillardet <[email protected]>

diff --git a/examples/Makefile b/examples/Makefile
index f7d7687..e77e783 100644
--- a/examples/Makefile
+++ b/examples/Makefile
@@ -13,8 +13,8 @@
 # Copyright (c) 2011-2016 Cisco Systems, Inc.  All rights reserved.
 # Copyright (c) 2012      Los Alamos National Security, Inc.  All rights reserved.
 # Copyright (c) 2013      Mellanox Technologies, Inc.  All rights reserved.
-# Copyright (c) 2017      Research Organization for Information Science
-#                         and Technology (RIST). All rights reserved.
+# Copyright (c) 2017-2018 Research Organization for Information Science
+#                         and Technology (RIST).  All rights reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@@ -133,6 +133,8 @@ ring_c: ring_c.c
        $(MPICC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@
 connectivity_c: connectivity_c.c
        $(MPICC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@
+spc_example: spc_example.c
+       $(MPICC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@
 
 hello_cxx: hello_cxx.cc
        $(MPICXX) $(CXXFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@

@ibm-ompi
Copy link

ibm-ompi commented Mar 8, 2018

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/c5876e6706ec5e36ae6922d74a9b71d2

@thananon
Copy link
Member

thananon commented Mar 8, 2018

Hmm, I wonder if it's something GNU specific in the code that breaks XLC?

  CC       pml_ob1_isend.lo
  CC       pml_ob1_rdma.lo
  CC       pml_ob1_rdmafrag.lo
  CC       pml_ob1_recvfrag.lo
  CC       pml_ob1_recvreq.lo
  CC       pml_ob1_sendreq.lo
  CC       pml_ob1_start.lo
1586-494 (U) INTERNAL COMPILER ERROR: Signal 11.
Calling signal handler...
/opt/ibm/xlC/13.1.5/bin/.orig/xlc_r: error: 1501-230 Internal compiler error; please contact your Service Representative. For more information visit:
http://www.ibm.com/support/docview.wss?uid=swg21110810
make[2]: *** [pml_ob1_isend.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/ompi/mca/pml/ob1'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/ompi'
make: *** [all-recursive] Error 1

@jjhursey
Copy link
Member

jjhursey commented Mar 8, 2018

Here is the verbose make at that point, in case it helps:

shell$ make V=1
depbase=`echo pml_ob1_isend.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
/bin/sh ../../../../libtool  --tag=CC   --mode=compile xlc_r -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../ompi/include -I../../../../oshmem/include -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/private/autogen -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/hwloc/autogen -I../../../../ompi/mpiext/cuda/c   -I../../../.. -I../../../../orte/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/hwloc/hwloc2a/hwloc/include    -O3 -DNDEBUG -finline-functions -fno-strict-aliasing  -MT pml_ob1_isend.lo -MD -MP -MF $depbase.Tpo -c -o pml_ob1_isend.lo pml_ob1_isend.c &&\
mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  xlc_r -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../ompi/include -I../../../../oshmem/include -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/private/autogen -I../../../../opal/mca/hwloc/hwloc2a/hwloc/include/hwloc/autogen -I../../../../ompi/mpiext/cuda/c -I../../../.. -I../../../../orte/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/event/libevent2022/libevent/include -I/home/mpiczar/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/mca/hwloc/hwloc2a/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT pml_ob1_isend.lo -MD -MP -MF .deps/pml_ob1_isend.Tpo -c pml_ob1_isend.c  -fPIC -DPIC -o .libs/pml_ob1_isend.o
Calling signal handler...
1586-494 (U) INTERNAL COMPILER ERROR: Signal 11.
/opt/ibm/xlC/13.1.5/bin/.orig/xlc_r: error: 1501-230 Internal compiler error; please contact your Service Representative. For more information visit:
http://www.ibm.com/support/docview.wss?uid=swg21110810
make: *** [pml_ob1_isend.lo] Error 1

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this looks pretty great. I have a bunch of nit-picky comments, and I also have some slightly larger comments.

Additionally, there should likely be some form of documentation with this. E.g., a man page or something.

MPI_Comm_size(MPI_COMM_WORLD, &size);
if(size != 2) {
fprintf(stderr, "ERROR: This test should be run with two MPI processes.\n");
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to MPI_ABORT here.

/* Make sure we found the counters */
if(index == -1) {
fprintf(stderr, "ERROR: Couldn't find the appropriate SPC counter in the MPI_T pvars.\n");
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: MPI_ABORT.

/* Counter names to be read by ranks 0 and 1 */
char counter_names[2][40];
sprintf(counter_names[0], "runtime_spc_OMPI_BYTES_SENT_USER");
sprintf(counter_names[1], "runtime_spc_OMPI_BYTES_RECEIVED_USER");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be ok to use string constants here with C99 initialization (vs. using sprintf -- some compilers warn about using sprintf).

@@ -36,6 +36,7 @@
#include "opal_stdint.h"
#include "opal/mca/btl/btl.h"
#include "opal/mca/btl/base/base.h"
#include "ompi/runtime/ompi_spc.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used.

return remove_head_from_ordered_list(&proc->frags_cant_match);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this appears to be unrelated to SPC. Might want to exclude this change from this PR.

int i, j, rank, world_size, offset, err;
long long *recv_buffer, *send_buffer;

ompi_communicator_t *comm = &ompi_mpi_comm_world.comm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be safe to use MPI_COMM_WORLD here. I think you might want to dup COMM_WORLD during your startup and then use that here in the dump (and then be sure to free that communicator when you're done with it).

}

/* Aggregate all of the information on rank 0 using MPI_Gather on MPI_COMM_WORLD */
if(rank == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference between these two blocks is the allocation of recv_buffer, right? Might want to just make that the only thing dependent upon rank == 0.


/* Once rank 0 has all of the information, print the aggregated counter values for each rank in order */
if(rank == 0) {
fprintf(stdout, "OMPI Software Counters:\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use fprintf or opal_output here?

Also, we use the name "Open MPI" publicly; we rarely use the name OMPI in help/output messages.

fprintf(stdout, "OMPI Software Counters:\n");
offset = 0; /* Offset into the recv_buffer for each rank */
for(j = 0; j < world_size; j++) {
fprintf(stdout, "World Rank %d:\n", j);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean "MPI_COMM_WORLD rank".

* This is denoted unlikely because the counters will often be turned off.
*/
if(OPAL_UNLIKELY(attached_event[event_id] == 1 && *cycles == 0)) {
*cycles = opal_timer_base_get_cycles();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above -- should this just use the high-precision timer.

@thananon
Copy link
Member

Per discussion at face to face meeting on 03/20/18:

  • @davideberius will have to follow up with the reviews.
  • SPC is bound for 4.0.0 release. No need to PR over to 3.0.x or 3.1.x.
  • Have to get in touch with compiler guy from IBM to see whats wrong with XLC.
  • "counter_names" and "counter_descriptions" might not be a good name for a symbol due to possible name collision. We might have to raise this problem with PAPI team.

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/154b75812fe732a7906766782994adb0

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/9c0c0618be5243ba277701cda384f8ef

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/065cc2e477d9417e7d6e1987ee5dc903

@bosilca
Copy link
Member

bosilca commented Mar 21, 2018

The XLC compiler seems puzzled by the existence of the following line of code "do {} while (0);".

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/504e194c3b351871519716e8e85541b6

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/f6d621757ccb5ea57cfe94d7251b8f93

#else /* SPCs are not enabled */

#define SPC_INIT() \
do {} while (0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the problem with XL, why don't you just have it:

#define SPC_INIT() ;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that some compilers do not like empty statements such as ";;".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this line is our problem.
We can test this hypothesis by putting something random there right?

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/fc7047f0071fd59b8deb07430b64b532

@thananon
Copy link
Member

Seems like we have a winner!

@jsquyres
Copy link
Member

Hmm. There's quite a few of my comments that are hidden by github in "outdated" blocks, but they weren't discussed / addressed. E.g., why invent your own bitmap handling instead of using opal_bitmap_t?

@bosilca
Copy link
Member

bosilca commented Mar 23, 2018

opal_bitmap_t is an overkill for our needs and in this context the emphasis is on performance. This answer also applies to your question about frequencies, we cannot afford to call an expensive timer as we are in the critical path of the MPI library (as an example in the matching).

@thananon
Copy link
Member

I just want to follow up on this PR. It has been a while.

Right now the code is functional. We just have some suggestions from @jsquyres that @bosilca does not agree with. (Regarding the CPU frequency changing dynamically, etc) So I request @ggouaillardet to pitch in.

If there is no discussion anymore, I suggest that we should merge this before it goes to the void.

@thananon
Copy link
Member

thananon commented May 8, 2018

Seems like no response will be made. @jsquyres can we agree and merge this?

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still quite hesitant about the variable frequency issue. I can be out-voted here, but Intel chips can (and do) change frequency depending on load. For a well-behaving HPC application (i.e., something that runs the CPU at 100%), the frequency likely won't change. But that may not be true in the general case (i.e., for all the plebian HPC apps out there).

char name[256], description[256];

/* Counter names to be read by ranks 0 and 1 */
char counter_names[2][40] = { "runtime_spc_OMPI_BYTES_SENT_USER",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be char *counter_names[] = { "...

char name[256], description[256];

/* Counter names to be read by ranks 0 and 1 */
char counter_names[2][40] = { "runtime_spc_OMPI_BYTES_SENT_USER",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before: char *counter_names[] = ...

*/

/* This enumeration serves as event ids for the various events */
enum ompi_spc_counters_t {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion earlier was in shorthand. This should probably be:

typedef enum ompi_spc_counters { ... } ompi_spec_counters_t

I think this will make it like many other declarations in OMPI.

OMPI_OOS_IN_QUEUE,
OMPI_MAX_UNEXPECTED_IN_QUEUE,
OMPI_MAX_OOS_IN_QUEUE,
OMPI_SPC_NUM_COUNTERS /* This serves as the number of counters. It must be last. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My prior comment actually intended you to put SPC_ in all of these enums. OMPI_SPC_NUM_COUNTERS was a single example, but it also applies to OMPI_SPC_MAX_OOS_IN_QUEUE, OMPI_SPC_MAX_UNEXPECTED_IN_QUEUE, ...etc. I know this is a PITA, but this is a limitation of C -- we need to have decent prefixes to our values as a primitive namespace. Sorry. 😦

*/
static int ompi_spc_get_count(const struct mca_base_pvar_t *pvar, void *value, void *obj_handle)
{
(void) obj_handle;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned twice, please use __opal_attribute_unused__.

*counter_value /= sys_clock_freq_mhz;
}
/* If this is a high watermark counter, reset it after it has been read */
if(index == OMPI_MAX_UNEXPECTED_IN_QUEUE || index == OMPI_MAX_OOS_IN_QUEUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my previous comment:

It might be better to have some kind of attribute somewhere that indicates that a particular SPC counter needs to be reset after being read (Vs. having a hard-coded list of specific enums here in the code).

}

/* Aggregate all of the information on rank 0 using MPI_Gather on MPI_COMM_WORLD */
send_buffer = (long long*)malloc(OMPI_SPC_NUM_COUNTERS * sizeof(long long));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my previous comment, please check for malloc failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be fixed this evening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this was already fixed, but this line didn't change so the comment still shows.

} ompi_spc_t;

/* Events data structure initialization function */
void events_init(void);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name needs to be prefixed with ompi_spc_.

int i, j, world_size, offset;
long long *recv_buffer = NULL, *send_buffer;

ompi_communicator_t *comm = &ompi_mpi_comm_world.comm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my prior comment:

It may not be safe to use MPI_COMM_WORLD here. I think you might want to dup COMM_WORLD during your startup and then use that here in the dump (and then be sure to free that communicator when you're done with it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this this evening.

@bosilca
Copy link
Member

bosilca commented May 9, 2018

@jsquyres what we are measuring are very short time intervals (such as the matching time). The change for a clock frequency change during such measurements is minimal (frankly non-existent) so I would not be concerned about the lack of accuracy. Moreover, we never return to the user absolute timing, so even if there is a drift in frequency during the execution the impact on the value returned will be minimal.

@jsquyres
Copy link
Member

@bosilca I hear what you're saying. And I agree: it'll be pretty rare for the frequency to change while you're making the measurement. But:

  1. It can still happen (e.g., if an app idles for a while before a message is received, e.g., while (!done) { MPI_Test(...); if (flag) break; sleep(1); }. Yes, we've all seen apps that do this. 😉 ).
  2. The frequency is measured at the beginning of the application -- not during each measurement. The chance of the frequency being different at the beginning of the application vs. the time of measurement is actually not small.

Here's an example:

  • An always-full HPC cluster finishes one job.
  • The job scheduler does all of its cleanup stuff at the end of the job, and resets the nodes for the next job.
  • During this time -- which may take a minute or three -- all but one of the cores spin down to a lower frequency.
  • The job scheduler finally gets the next job and starts it up, one process per core.
  • Each process calls MPI_INIT right away, which samples the frequency.
  • The process on the first core may will get a high frequency (because it's where the job scheduler was running and has been busy the whole time). But all the other cores may be in a lower frequency because they were more-or-less idle for a little while.
  • After a short while running the MPI apps, all the cores are back up to their full frequency, but the damage was already done: they measured their frequency during MPI_INIT as X, but now they're running at frequency Y (where X<Y).

That can definitely happen.

I'm also curious: what is the usefulness of having fast-but-inaccurate timings? If x% of your timings could be incorrect, how is that data useful?

Maybe it would be enough to have 2 different methods of timing: one based on your current frequency method, and another based on slower/more-accurate methods (e.g., clock_gettime()), and have an MCA param to switch between the two. There are tradeoffs to both methods, of course:

  • Fast but possibly inaccurate.
  • Slow by possibly affects the overall runtime (because it's adding "slow" clock checks in performance-critical code paths).

@jsquyres
Copy link
Member

Right after I hit submit, I thought of a better example where the frequency can change: while blocking for disk I/O. E.g.:

while (!done) {
    MPI_Test(...);
    if (flag) break;
    read(fd, large_chunk_of_disk_data, ...);
}

Given that the read() is blocking, it could actually swap out for a little while (especially if it's waiting on a device or the network), and could actually affect the core frequency.

(this is somewhat moot because the frequency is only measured at the beginning of the app, but it's just better example than the sleep() example I listed above)

@bosilca
Copy link
Member

bosilca commented May 14, 2018

@jsquyres :

  1. if the cost of taking the measure significantly impacts the process itself, it is not worth measuring.
  2. we never measure things across MPI calls (it is important because)
  3. current processors do not exhibit the problem you are describing anymore.

The example you provided will work just fine with modern processors. For older processors the results might be off if there are frequency changes between MPI calls, but there I am not sure that people really care about measuring matching time.

@davideberius
Copy link
Contributor Author

davideberius commented Jun 6, 2018

I checked to make sure that we are verifying that the timers are monotonic. The following code shows how this is done right now. Should I add an SPC specific warning message here?

int opal_timer_linux_open(void)
{
    int ret = OPAL_SUCCESS;

    if (mca_timer_base_monotonic && !opal_sys_timer_is_monotonic ()) {
#if OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC)
        struct timespec res;
        if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
            opal_timer_linux_freq = 1.e3;
            opal_timer_base_get_cycles = opal_timer_linux_get_cycles_clock_gettime;
            opal_timer_base_get_usec = opal_timer_linux_get_usec_clock_gettime;
            return ret;
        }
#else
        /* Monotonic time requested but cannot be found. Complain! */
        opal_show_help("help-opal-timer-linux.txt", "monotonic not supported", true);
#endif  /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
    }
    ret = opal_timer_linux_find_freq();
    opal_timer_base_get_cycles = opal_timer_linux_get_cycles_sys_timer;
    opal_timer_base_get_usec = opal_timer_linux_get_usec_sys_timer;
    return ret;
}

@thananon
Copy link
Member

thananon commented Jun 6, 2018

We would like to get this into 4.0. The fork is this Friday IIRC. We should reach agreement on that timer/tick rate or actually remove it completely. The new matching is fast anyway.

@bosilca
Copy link
Member

bosilca commented Jun 6, 2018

@jsquyres poke

void ompi_spc_fini(void)
{
#if SPC_ENABLE == 1
ompi_spc_dump();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I forgot to add the check for whether dumping is enabled. I'll add this when I get home this evening.

@davideberius
Copy link
Contributor Author

Ok, I've made all of the changes requested. What is the status of the frequency issue?

@bosilca
Copy link
Member

bosilca commented Jun 9, 2018

I don't think there is an issue anymore. We discussed it with Jeff privately, and I think we agreed that as long as there is a warning we are good to go. But I hope @jsquyres will confirm and then approve this PR so that we can merge.

@jsquyres
Copy link
Member

I just filed a quick PR (davideberius#1 -- I couldn't seem to push to @davideberius's PR branch for some reason) with some minor suggestions.

In terms of the frequency issue, there's two points:

  1. George pointed out that we're using the invariant frequency, so the change-the-frequency issue shouldn't be an issue. Ok, I'm good with that.
  2. However, that's still only relevant to an individual core. I still wouldn't mind a warning of some kind if we know that you may well get some level of non-determinism in the results (e.g., if your process is not bound to core, and therefore you may see the cycle count from different cores upon subsequent sampling).

@bosilca
Copy link
Member

bosilca commented Jun 10, 2018

@davideberius highlighted that we have the same level of protection for timers as MPI_Wtime. And we inherit the same level of warning.

Copy link
Member

@thananon thananon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. Squash and go.

@davideberius
Copy link
Contributor Author

Ok, I'll squash this when I get back home tonight. Should this all be one commit?

@bosilca
Copy link
Member

bosilca commented Jun 11, 2018

Technically there is no strict rule on the number of commits (the details are left to the PR author(s)). In this particular case, due to a significant number of updates, I would indeed squash down to 1.

…ral counters.

This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf).  More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.

All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined.  The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper.  Added a --with-spc configure option to enable SPCs in the Open MPI build.  This defines SOFTWARE_EVENTS_ENABLE.  Added an MCA parameter, mpi_spc_enable, for turning on specific counters.  Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize.  Added an SPC test and example.

Signed-off-by: David Eberius <[email protected]>
@davideberius
Copy link
Contributor Author

I have squashed all of the commits into a single commit.

@jsquyres
Copy link
Member

bot:mellanox:retest
bot:ompi:retest

Looks like the Jenkins/Github git clone bug again. 😦

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca Ok, fair enough that we're in the same category as MPI_WTICK/MPI_WTIME, which means that my concerns are not specific to this PR (i.e., we shouldn't hold up this PR any further, and should instead address that issue outside of this PR). I would still like to see some kind of disable-able warning about this ("your time may not be constant!"), but that will be outside of this PR.

@bosilca
Copy link
Member

bosilca commented Jun 12, 2018

More than what @davideberius pinpointed to few comments above ?

opal_show_help("help-opal-timer-linux.txt", "monotonic not supported", true);

@jsquyres
Copy link
Member

@bosilca Yes, more than that (but perhaps similar?) -- opal_sys_timer_is_monotonic() only returns if the system clock supports monotonic time or not. It does not take into effect process binding (which will be the more likely case: your system supports monotonic time, but your process isn't bound to where it will consistently get the same clock).

Dumb question: is the OMPI time-invariant clock-reading-code getting its values from a core or from a package? (I assume core, but I don't know that for a fact) I.e., if a process is bound to a socket/package, is it still going to always get consistent time?

@bosilca
Copy link
Member

bosilca commented Jun 12, 2018

My understanding is that it depends on the processor, but I have no idea how to check what processor has what type of monotonic hwclock. I agree that from a user perspective it could be nice to be extremely attentive, but realistically it's a lot of work for little benefit.

@thananon thananon merged commit 390d72a into open-mpi:master Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants